#### PGGM Bootcamp Text Analytics 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](images/1_1.png)

# 1.1 Basic Functions
* [1.1.1. Working with strings](#1.1.1)
* [1.1.2. Processing free-text](#1.1.2)
* [1.1.3. Reading and writing files](#1.1.3)

---

### 1.1.1 Working with strings
<a id="1.1.1">

In [None]:
text1 = "The company ABN AMRO is a modern, full-service bank with a transparent and client-driven business model, a moderate risk profile"
text1

In [None]:
len(text1) # The length of text1

In [None]:
text2 = text1.split(' ') # Return a list of the words in text2, separating by ' '.

len(text2)

In [None]:
text2

<br>
List comprehension allows us to find specific words:

In [None]:
[w for w in text2 if len(w) > 3] # Words that are greater than 3 letters long in text2

In [None]:
[w for w in text2 if w.istitle()] # Capitalized words in text2

In [None]:
[w for w in text2 if w.endswith('s')] # Words in text2 that end in 's'

<br>
We can find unique words using `set()`.


In [None]:
text3 = 'The annual Report for 2019 showing annual results'
text4 = text3.split(' ')

len(text4)

In [None]:
len(set(text4))

In [None]:
set(text4)

In [None]:
len(set([w.lower() for w in text4])) # .lower converts the string to lowercase.

In [None]:
set([w.lower() for w in text4])

---
### 1.1.2 Processing free-text
<a id="1.1.2">

In [None]:
text5 = 'Between 29 December 2017 and 31 December 2018, ABN AMRO’s share price (depositary receipts) declined 24% while the XXXX index declined 28%'
text6 = text5.split(' ')

text6

<br>
Finding percentages:

In [None]:
[w for w in text6 if w.endswith('%')]

<br>
Finding webpage:

In [None]:
text7 = '$NYT NEW ARTICLE : Can The New York Times Company Repeat Its Success in 2020? dashboard.stck.pro/news.php... Get all the latest @nytimes related news here : https://dashboard.stck.pro/news.php?ticker=NYT'
text8 = text7.split(' ')

In [None]:
[w for w in text8 if w.startswith('http')]

<br>

We can use regular expressions to help us with more complex parsing. 



For example `'@[A-Za-z0-9_]+'` will return all words that: 
* start with `'@'` and are followed by at least one: 
* capital letter (`'A-Z'`)
* lowercase letter (`'a-z'`) 
* number (`'0-9'`)
* or underscore (`'_'`)


In [None]:
import re # import re - a module that provides support for regular expressions

[w for w in text8 if re.search('@[A-Za-z0-9_]+', w)]

---
#### *Learn more about Regex in Python at [Guru99.com](https://www.guru99.com/python-regular-expressions-complete-tutorial.html) and practice your skills at [https://regex101.com](https://regex101.com/)*

---
### 1.1.3 Reading and writing files
<a id="1.1.3">

A fundamental part of dealing with text data is to have the ability to comunicate with diferent formats and type of files

There are plenty of format options and matters related to encoding, in the following notebooks we will be using 3 types:  
**a. Binary store with Pickle**  
**b. Plain text files with Python**  
**c. Dataframes with Pandas**   

Advantages of pickle-ing   

- saving a program's state data to disk so that it can carry on where it left off when restarted (persistence)  

- storing python objects in a database  

- converting an arbitrary python object to a string so that it can be used as a dictionary key (e.g. for caching & memoization).   

- pickle is written in pure Python, it's easier to debug.

In [None]:
import pickle

In [None]:
#pro TIP: explore the functions/properties of an object
def functions(obj):
    return [prop for prop in dir(obj) if not prop.startswith('_')]

In [None]:
a = {1:"6",2:"2",3:"f"}
a

In [None]:
file_name = "test/dict.pickle"

First, import pickle to use it, then we define an example dictionary, which is a Python object

In [None]:
# open the file for writing
pickle_out = open(file_name,'wb')

In [None]:
# this writes the object a to the file named 'testfile'
pickle.dump(a,pickle_out)   

# here we close the file object
pickle_out.close()

In [None]:
pickle_out

In [None]:
# we open the file for reading
pickle_in = open(file_name,'rb')  

In [None]:
# load the object from the file into var b
b = pickle.load(pickle_in)
b

In [None]:
pickle_in

In [None]:
a==b

---
#### *Learn more about "Pickle-ing" in Python at [Python Wiki](https://wiki.python.org/moin/UsingPickle)*

<br>
Writing text files

In [None]:
plain_text = text1+' '+text5

In [None]:
plain_text

In [None]:
file_name2 = 'test/text.txt'

In [None]:
# open the file for writing
text_out = open(file_name2,'w')
text_out.write(plain_text)

In [None]:
# we open the file for reading
text_in = open(file_name2,'r')
text_in.read()

---
#### *Learn more about "Python File Handling" in Python at [Guru99.com](https://www.guru99.com/reading-and-writing-files-in-python.html)*

<br>
Writing tables with pandas

In [None]:
import pandas as pd

In [None]:
#table example in dictionary form
dict_table = {1:[text1], 2:[text5]}

In [None]:
#convert to dataframe
pandas_table = pd.DataFrame(dict_table)

In [None]:
#file name including extension
file_name2 = 'test/table.csv'

# write to csv
pandas_table.to_csv(file_name2)

Question mark command `?` to explore Python classes and functions 

In [None]:
#Example
pd.pivot_table?

---
#### *Learn more Pandas tricks [www.dataschool.io](https://www.dataschool.io/python-pandas-tips-and-tricks/)*