In [3]:
# Importing relevant packages 

import pandas as pd
import re


# Advanced Social Data Science 2 (ASDS2) Exercises


## April 19: Overview and regular expressions

### 1: Thinking about text as data

Go to Kaggle’s database of text data sets here: https://www.kaggle.com/datasets?topic=nlpDatasets 

1. Find an interesting data set. (Try searching the data sets or playing around with the sorting rule in the top right). It doesn’t have to be social sciencey, just whatever looks interesting to you.
2. Describe the variables in the data. What’s there in addition to the text itself, if anything?
3. What’s a meaningful latent variable which might vary across the texts? (For example, ‘sentiment’ might plausibly vary across movie reviews).
4. Assume you could measure the latent variable from (3). How might that latent variable correlate with other properties of the units of the data? (These can be observed variables in the data, or other, unobserved properties).


### 2: Importing text data

1. The file mach.csv, available at the course Absalon page, contains part of Machiavelli’s The Prince, subdivided into 188 sections. Download it to your computer.
2. Import the file into Python using read_csv() from pandas 

In [4]:
# Importing Machiavelli file

mach = pd.read_csv("mach.csv")
mach = mach.rename(columns = {'Unnamed: 0': 'section'}) #Renaming unnamed column


3. Using the search function from Python’s re module (or a Pandas equivalent), find out in which section(s) the following terms appear:
    - lion
    - flatterers
    - ccmnot

In [5]:
# Creating columns containing a boolean value for each section showing whether the term is present or not

# pd.contains uses re.search() and returns a boolean indicating whether the given term is contained in the text. 
# re.search() will also return True if the term is part of a longer word, e.g. lioness or lionization

mach['lion'] = mach['text'].str.contains('lion') 
mach['flatterers'] = mach['text'].str.contains('flatterers')
mach['ccmnot'] = mach['text'].str.contains('ccmnot')


In [4]:
print('Sections containing the word \"lion\"')
print(mach['section'][mach.lion == True])

print('\nSections containing the word \"flatterers\"')
print(mach['section'][mach.flatterers == True])

print('\nSections containing the word \"ccmnot\"')
print(mach['section'][mach.ccmnot == True])

Sections containing the word "lion"
26     Mach_122.txt.content
27     Mach_123.txt.content
44     Mach_139.txt.content
47     Mach_141.txt.content
97     Mach_187.txt.content
112     Mach_30.txt.content
139     Mach_55.txt.content
166      Mach_8.txt.content
Name: section, dtype: object

Sections containing the word "flatterers"
74    Mach_166.txt.content
75    Mach_167.txt.content
76    Mach_168.txt.content
Name: section, dtype: object

Sections containing the word "ccmnot"
53    Mach_147.txt.content
Name: section, dtype: object


4. Why might a nonsensical term like ‘ccmnot’ be in the corpus?

From investigating the section that contains 'ccmnot', it seems it is a spelling mistake and was supposed to say 'cannot'. Perhaps the Machiavelli text was made digital by scanning and this word was misread by the scanner.

In [5]:
# Exploring the section containing 'ccmnot'. 

print(mach['text'][mach.ccmnot == True].values)

[' But let us return to our subject. I maintain that anyone who considers what I have written will realise that either hatred or contempt led to the downfall of the emperors I have discussed; he will recognise that some of them acted in one way and others in the opposite way, and that one ruler in each group was successful and the others ended badly. Because Pertinax and Alexander were new rulers, it was useless and harmful for them to act like Marcus, who was an hereditary ruler. Likewise, it was harmful for Caracalla, Commodus and Maximinus to act like Severus, because they lacked the ability required to follow in his footsteps. Therefore, a new ruler in a new principality ccmnot imitate the conduct of Marcus, nor again is it necessary to imitate that of Severus. Rather, he should imitate Severus in the courses of action that are necessary for establishing himself in power, and imitate Marcus in those that are necessary for maintaining power that is already established and secure, th

### 3: Regular expressions

In this exercise, we’re continuing with Python’s re module. 
<br>The following can be solved using one or more from these three functions in re:
`search`
`split`
`sub`

Hint: Take a look at the documentation for Python's re module to find solutions, and test your regular expression on regextester.com or consult regex101.com 

1. Define a string and check that it contains only a certain set of characters (in this case a-z, A-Z and 0-9). 

In [6]:
def specific_char(string):
    string = re.search(r'[^a-zA-Z0-9]', string) #Regex search to find characters that do not (^) match with a-z, A-Z or 0-9
    return not bool(string)                     #If any characters not in the set are found, this returns False. 

print(specific_char("ABCDEFabcdef123450")) 
print(specific_char("*&%@#!}{"))

True
False


2. Define a string and check that it has an _a_ followed by zero or more _b_'s.

In [7]:
def ab_match(text):
    if re.search('ab*?',  text): #*? matches the previous token (b) between zero and unlimited times
        return 'Found a match!'
    else:
        return('Not matched!')

print(ab_match("ac"))
print(ab_match("abc"))
print(ab_match("abbc"))
print(ab_match("bbc"))

Found a match!
Found a match!
Found a match!
Not matched!


3. Define a string and check that it has an a followed by one or more b's.

In [8]:
def ab_match(text):
    if re.search('ab+?',  text): #*+ matches the previous token (b) between one and unlimited times
        return 'Found a match!'
    else:
        return('Not matched!')

print(ab_match("ab"))
print(ab_match("abc"))
print(ab_match("acb"))

Found a match!
Found a match!
Not matched!


4. Using the sample string ‘The quick brown fox jumps over the lazy dog’, search for the words 'fox', 'dog', 'horse'.

In [9]:
patterns = ['fox', 'dog', 'horse']
text = 'The quick brown fox jumps over the lazy dog'

for pattern in patterns:
    print('Searching for "%s" in "%s" ->' % (pattern, text),)
    if re.search(pattern,  text):
        print('Matched!\n')
    else:
        print('Not Matched!\n')
        

Searching for "fox" in "The quick brown fox jumps over the lazy dog" ->
Matched!

Searching for "dog" in "The quick brown fox jumps over the lazy dog" ->
Matched!

Searching for "horse" in "The quick brown fox jumps over the lazy dog" ->
Not Matched!



5. Define a string with the word ‘Road’ in it, and abbreviate 'Road' as 'Rd.' using sub().

In [10]:
#Define a string with the word ‘Road’ in it, and abbreviate 'Road' as 'Rd.' using sub().

text = 'The quick brown fox jumps over the lazy dog on Hampton Road'

print(re.sub('Road', 'Rd.', text))


The quick brown fox jumps over the lazy dog on Hampton Rd.


6. Define a string and perform very simple tokenization by splitting at all whitespaces.

In [11]:
#Define a string and perform very simple tokenization by splitting at all whitespaces.

text = 'The quick brown fox jumps over the lazy dog.'

print(re.split(' ', text))

#This can also be done easily without regex using text.split(' ')

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']


7. Define a string and replace whitespaces with an underscore. After, reverse this by replacing underscores with a whitespace.

In [12]:
text = 'Python Exercises'
text1 = 'Python_Exercises'

print(text, '-->',re.sub(" ", "_", text))
print(text1, '-->', re.sub("_", " ", text1))

#Alternative solution, not using regex
print()
print('Alternative solution, same result')
print(text, '-->', text.replace(' ', '_'))


Python Exercises --> Python_Exercises
Python_Exercises --> Python Exercises

Alternative solution, same result
Python Exercises --> Python_Exercises


8. Define a string with a few cases of multiple spaces between words and remove all those cases.

In [13]:
text = 'Python      Exercises'
print("Original string:\t",text)
print("Without extra spaces:\t",re.sub(' +',' ',text)) #+ matches the previous token between one and unlimited times


Original string:	 Python      Exercises
Without extra spaces:	 Python Exercises
