#https://tinyurl.com/ANLPTutorial2Part1
## Go to "File" -> "Save a Copy in Drive..."

This lets you create your own copy of the notebook in your Google drive, and any changes you make doesn't impact the shared notebook

## Text analysis using Python - Part 1

The first step is to install the required libraries using the pip command (if you don't have them), and import the modules from the libraries.



In [None]:
#Enable plots to be displayed in the notebook
%matplotlib inline

!pip install seaborn

import pandas as pd
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

## Mounting the drive

In this notebook, I'm mounting the Google drive to read a csv file that is stored on my drive. You must allow access to your drive by signing in to your Google account.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Download the dataset from here: https://drive.google.com/file/d/1qAC9x-WMwzofyG8l1fFUoHwWhk3n61fp/view?usp=sharing

Then, copy it to your Google drive folder which contains the notebook

In [None]:
# After executing the cell above, Drive files will be present in "/content/drive/My Drive". The below command lists the contents in the drive:
!ls "/content/drive/My Drive/Colab_Notebooks/ANLP"

## Reading Data from a CSV File

To read the data from the input csv file from my Google drive and store it as a Python dataframe, I use the read_csv() function from Pandas. You have to change the folder location to where the file is stored in your own Gdrive - mine is in this path: 
/content/drive/My Drive/Colab_Notebooks/ANLP/CNN_Articles_2021-2023.csv

You can read about the different functions and their input parameters in the  documentation for the library:
[Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/)

**Note:** Comment code below if you are not importing from your Gdrive folder

 

In [None]:
#The input csv file is a subset of the data from https://github.com/hadasu/CNN_web_crawler
newsdf = pd.read_csv('/content/drive/My Drive/Colab_Notebooks/ANLP/CNN_Articles_2021-2023.csv')

## Reading input file from a url
The alternative option is to read in the CSV from a web url (on github) and store it in a dataframe. This is a smaller dataset containing articles only from 2021 January to March.


In [None]:
url = 'https://github.com/AntonetteShibani/NLPAnalysis/blob/main/CNN_Articles_2021.csv?raw=true'
newsdf = pd.read_csv(url)

## Preliminary data inspection

We usually try to get a a sense of the data first (particularly useful for large data sets where opening in other UI based tools is not easy)

In [None]:
#Print general information about a DataFrame including the index dtype and columns, non-null values and memory usage
newsdf.info()

In [None]:
newsdf.rename(columns={'Unnamed: 0': 'ID'}, inplace=True)

In [None]:
#Generate descriptive statistics that summarizes the central tendency, dispersion and shape of a dataset’s distribution
newsdf.describe()

In [None]:
# Use the .head(n) function to look at the first 'n' rows of our news dataframe. The default n is 5, we are now changing it to view the first 10 rows
newsdf.head(10)


In [None]:
#A function similar to above, but provides a random sample of rows rather than the first few. 
newsdf.sample(5)

## Word Count

Word counts are simple but useful indicators for asking questions on the length of texts.

To demonstrate usage, we see how the metrics are calculated for one sample sentence from the dataset. 

In [None]:
s = newsdf['headline'][2]
print(s)

#Splitting by whitespace characters and calculating the length. Note that punctuation marks are also counted as words
len(s.split())

In [None]:
#To make it easier to reuse in the future, we can create a function that returns word count
def word_count(text):
    wc = len(text.split())
    return wc

Now now we can apply the word_count function to our text variable to create a new variable with the number of words in the news article text.

In [None]:
newsdf['article_word_count'] = newsdf['text'].apply(word_count)

We can use describe, hist, and scatter functions to provide some information on the length of articles in our dataset

In [None]:
newsdf['article_word_count'].describe()

In [None]:
newsdf['article_word_count'].hist(bins = 10)

In [None]:
sns.boxplot(x = "part_of", 
            y = "article_word_count",
            data =newsdf);

In [None]:
#I'm using a function that populates bar graph from a dataframe variable
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords

def wordBarGraphFunction(df,column,title):
    topic_words = [ z.lower() for y in
                       [ x.split() for x in df[column] if isinstance(x, str)]
                       for z in y]
    word_count_dict = dict(Counter(topic_words))
    popular_words = sorted(word_count_dict, key = word_count_dict.get, reverse = True)
    popular_words_nonstop = [w for w in popular_words if w not in stopwords.words("english")]
    plt.barh(range(50), [word_count_dict[w] for w in reversed(popular_words_nonstop[0:50])])
    plt.yticks([x + 0.5 for x in range(50)], reversed(popular_words_nonstop[0:50]))
    plt.title(title)
    plt.show()

In [None]:
plt.figure(figsize=(10,10))
wordBarGraphFunction(newsdf,'headline',"Most frequent words in news article headlines (Jan-Mar 2021)")

We can further explore the articles which are of the longest and shortest lengths

In [None]:
#shortest
newsdf.sort_values(by='article_word_count').head(10)

In [None]:
#longest
newsdf.sort_values(by='article_word_count', ascending=False).head(10)

You can then examine the content of individual articles to gain additional insight as as needed.

## Word frequencies

Word frequencies (counting how often words occur) is a critical step in quantifying texts for many kinds of text analysis. There are inbuilt functions in Python that can compute words frequencies.

Note that this analysis disregards the word order in the original sentence, taking a bag-of-words approach.


Calculate frequencies to determine the most common word in the corpus

In [None]:
# converting series to string
article_text = newsdf['text'].to_string()

#create word tokens
tokenized_words=word_tokenize(article_text)

In [None]:
all_words=nltk.FreqDist(tokenized_words)
all_words.plot(10);
print(all_words.most_common(20))

Create a word cloud to show most common words in the article text.

Note: There are so many ways in which you can customise word clouds for display, check out the documentation and read related blogs posts to try different combinations.

In [None]:
from wordcloud import WordCloud
wordcloud = WordCloud(max_words=100).generate(article_text)

import matplotlib.pyplot as plt
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

You will notice that the most frequent terms are stopwords and punctuations, let's try recalculating frequencies after performing some basic cleaning.

In [None]:
# converting article text to lowercase as Python is case-sensitive
article_text_lower = article_text.lower()

#create word tokens
tokenized_words=word_tokenize(article_text_lower)

#Set up stop words for removal
nltk.download('stopwords')
from nltk.corpus import stopwords
#stopwords
stop_words=stopwords.words("english")
print(stop_words)
#Add custom stopwords to the list
stop_words.extend(["cnn", "'s", "a", "the"])

In [None]:
#Create a new variable to store filtered tokens 
filtered_tokens=[]
for w in tokenized_words:    
    if w not in stop_words:
         #add all filtered tokens excluding stopwords in this list below
         filtered_tokens.append(w)

import string
# punctuations
punctuations=list(string.punctuation)
#Add custom punctuations to the list
punctuations.append("...")

#Create another variable to store all clean tokens
filtered_tokens_clean=[]
for i in filtered_tokens:
    if i not in punctuations:
        filtered_tokens_clean.append(i)

Now that we have cleaned the input text, let's calculate frequencies again to view the most common words.

In [None]:
all_words=nltk.FreqDist(filtered_tokens_clean)
all_words.plot(10);
print(all_words.most_common(20))

Exercise: What are the insights from here? What do the key words indicate?

## Collocations
It is also quite useful to find the most common words that co-occur. These two or three words that occur together are also known as BiGrams and TriGrams, but collocations are more meaningful than them. 


In [None]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
fourgram_measures = nltk.collocations.QuadgramAssocMeasures()
finder = BigramCollocationFinder.from_words(filtered_tokens_clean)

#Using PMI scores to quantify and rank the BiGrams
finder.nbest(bigram_measures.pmi, 50)

## Word associations

Let's see what the most associated words in our text are. You can repeat the same using trigrams.

In [None]:
from nltk import BigramAssocMeasures
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(filtered_tokens_clean)

finder.nbest(bigram_measures.likelihood_ratio, 20)


We can also find words associated with specific words of interest. Let's do this using trigrams in NLTK 


In [None]:
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# Ngrams with 'city' as a member
city_filter = lambda *w: 'city' not in w

finder = TrigramCollocationFinder.from_words(filtered_tokens_clean)

# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
# only trigrams that contain 'city'
finder.apply_ngram_filter(city_filter)

# return the 10 n-grams with the highest PMI
# print (finder.nbest(trigram_measures.likelihood_ratio, 10))
for i in finder.score_ngrams(trigram_measures.likelihood_ratio):
    print (i)

Try it out: Visualise the key associations of words in a graph format

https://lyonwj.com/blog/nlp-with-neo4j


## Concordances

We can further look up the locations at which a given word occurs in the news articles using a concordance analysis.


In [None]:
from nltk.text import Text  
textlist = Text(filtered_tokens_clean)
print(textlist)
textlist.concordance('city')
textlist.concordance("city", width=100, lines=10)

## Regular expressions

A **regular expression** (or RE) is used to match strings of text such as particular characters, words, or patterns of characters. These come in quite handy for a number of operations in string manipulation. For instance, we can extract name from an email ID, Title from a name, subject code from a text description, or components of an address. 

There are commonly used wild card patterns in Python that helps us extract useful information from texts:
^

This wild card matches the characters at the beginning of a line.

$

This wild card matches the characters at the end of the line.

.

This wild card matches any character in the line.

s

This wild card is used to match space in a string.

S

This wild card matches non-whitespace characters.

d

This wild card matches one digit.

*

This wild card repeats any preceding character zero or more times. It matches the longest possible string.

*?

This wild card also repeats any preceding character/characters zero or more times. However, it matches the shortest string following the pattern.

+

This wild card repeats any preceding character one or more times. It matches the longest possible string following the pattern.

+?

This wild card repeats any preceding character one or more times. However, it matches the shortest possible string following the pattern.

[aeiou]

It matches any character from a set of given characters.

[^XYZ]

It matches any character not given in the set.

 [a-z0-9]

It matches any character given in the a-z or 0-9.

(

This wild card represents the beginning of the string extraction.

)

This wild card represents the end of the string extraction.


Read examples of applications here: https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149, and more examples here: https://developers.google.com/edu/python/regular-expressions

In [None]:
!pip install regex
import regex as re

test_string = '''Guidelines for using ChatGPT in 36118: ANLP 2023
To access ChatGPT, anyone can sign up for a free account at https://chat.openai.com/auth/login

Key limitations of the technology
It’s important to recognise that whilst AI tools enable us to do some amazing things, they have their limitations. Unlike search engines that provide you with options that match your query, ChatGPT gives you a definitive answer that may limit your own research abilities. Google has already entered this space, and there will be more to follow including the integration of GPT-3 in everyday tools such as Microsoft Teams. 
ChatGPT is a language model, not a knowledge base. 
One way to think about it is that it is a very sophisticated autocorrect. It has the capacity to generate text by anticipating what words should follow, creating cohesive and confident sounding sentences. However, it does not (and can not) validate the accuracy of the information it provides. These confident, but incorrect statements have been termed ‘hallucinations. In more problematic instances, it has even been shown to gaslight users. This subject requires you to do your own research towards topics of your interest, and ensure that information provided is correct.
ChatGPT can’t recognise its sources. 
This means it can’t reference where the information is coming from. It has also been shown to make up false references and citations that do not exist in order to sound correct. As a language model, its objective is to present plausible sentences, not accurate information. This subject requires you to reference your research to ensure your understanding is founded on credible information.
Limited data.
ChatGPT has only been trained on data up until 2021. This means it does not have access to any up to date information, nor can it access new information as it is not connected to the internet. Whilst some argue that AI presents a paradigm shift for new creative outputs (WIRED), others argue that the limitations of AI data is a dangerous constraint on human imagination (Horning, 2022). This subject requires you to research and explore up to date trends and emerging signals, as well as challenging you to push your creativity.
AI data has inherent (and often problematic) biases.
An ongoing issue in the creation and implementation of AI tools is the inherent biases from training data. The reflection and reproduction of assumptions and biases has led to highly problematic results, like racist and sexist claims. This subject requires you to examine personal biases and critically reflect upon values in the futures you propose.
Generative AI tools might have violated privacy laws.
Large language models in GPT-3 and similar tools have obtained data from the web for training purposes, not all of which are openly available for public use. It’s been called a ‘privacy nightmare’ as the tool has over 300 billion words scraped from the internet: books, articles, websites and posts – including personal information obtained without consent. Other generative AI tools that generate images have also been under scrutiny as they face lawsuits for using Copyrighted works of art.
ChatGPT is a proprietary tool
ChatGPT is owned by OpenAI, which has control of the tool and your user data. The source code and the underlying models are not open sourced, meaning we do not know how the model actually works. The openly available version is a free research preview which does not guarantee access (You may need to pay for this in the future). Alternate open source models and tools are currently being developed - you may contribute to them!
'''
print(test_string)


In the example below, we extract all words that start with the letter 'C'

In [None]:
startswithC = re.findall(r'(C\w+)', test_string)

for txt in startswithC:
    print(txt)

In [None]:
#Note how they are case-sensitive
startswithC = re.findall(r'(c\w+)', test_string)

for txt in startswithC:
    print(txt)

Exercise: Can you try creating one that captures lower case or upper case characters?

In [None]:
#Extract URLS following a certain format
print(re.search("(?P<url>https?://[^\s]+)", test_string).group("url"))

Let's write a function that can return matching texts and test it out with RegEx patterns.

In [None]:
def find_with_regex(regex, text):
    matches = []
    # find all matching patterns 
    for group in regex.findall(text):
        matchingtext = ''.join(group)
        matches.append(matchingtext)
         
    print("All matching texts: ")
    print(matches)

In [None]:
#Extracting any integer
pattern = re.compile(r'[0-9]')
find_with_regex(pattern, test_string)

In [None]:
#Extracting string with integers with at least 4 digits and at most 7 digits
pattern = re.compile(r'\d{4,7}(?!\d)')
find_with_regex(pattern, test_string)

Note: match() will only match if the string starts with the pattern. search() module will return the first occurrence that matches the specified pattern. findall() will iterate over all the lines of the file and will return all non-overlapping matches of pattern in a single step


**Exercise:** Try writing your own RegEx that can capture citations in text E.g. (Horning, 2022)