# Data Analysis with Python

## Outline
* Classes
* Files
* Scripts
* String Handling and Processing

## Classes
Python has the following built-in classes of objects:

* int
* float
* str
* list
* set
* tuple
* dictionary

Python is an **object-oriented programming language**.

While this course will not do an in depth presentation of OOP, it is necessary to understand the basics of objects and classes.

A **class** is a template.  It consists of a set of variables and functions.  An **object** is an instance of a class.  Class/object variables are referred to as **attributes** while class/object functions are referred to as **methods**.  Attributes and methods may belong to the object or the class.

Consider the following toy example:

** Example 1 **

In [1]:
#PYTHON3
class Animal():
    '''This is the documentation'''
    
    animal_list=[]
    
    def print_animals():
        for elem in Animal.animal_list:
            print(elem)
    
    def __init__(self,a_species,a_sound):#constructor
        Animal.animal_list.append(a_species)
        self.species=a_species
        self.sound=a_sound
        
    def say(self):
        print("A",self.species,"says",self.sound)      

In [34]:
# THIS WORKS IN PYTHON2 AND PYTHON3
#NOTE THE CLASS METHOD REQUIRES:
###   1) the argument "cls".  this stands for class
###   2) the "@classmethod" decorator.  decorators are an advanced topic


class Animal():
    '''This is the documentation'''
    
    animal_list=[]
    
    @classmethod
    def print_animals(cls):
        for elem in Animal.animal_list:
            print(elem)
    
    def __init__(self,a_species,a_sound):#constructor
        Animal.animal_list.append(a_species)
        self.species=a_species
        self.sound=a_sound
        
    def say(self):
        print("A",self.species,"says",self.sound)  

* `Animal` is the **class**
* `animal_list` is a **class attribute**
* `print_animals` is a **class method**
* `species` and `sound` are each an **object attribute**
* `say` is an **object method**

In [2]:
tigger=Animal("tiger","roar")#using the constructor
teddy=Animal("bear","grrr")

* `tigger` and `teddy` are each **objects**.  They are instances of the `Animal` **class**.
* The method `Animal()` is known as a **constructor**.  Constructors return an object instance.

In [3]:
Animal.animal_list

['tiger', 'bear']

In [4]:
Animal.print_animals() #NO in py2 without cls and decorator

tiger
bear


In [5]:
tigger.print_animals()

TypeError: print_animals() takes 0 positional arguments but 1 was given

In [6]:
teddy.species

'bear'

In [7]:
tigger.say()

A tiger says roar


In [8]:
print(type(tigger))

<class '__main__.Animal'>


In [9]:
isinstance(tigger,Animal)

True

In [10]:
isinstance(tigger,int)

False

In [11]:
tigger.__doc__

'This is the documentation'

In [12]:
tigger

<__main__.Animal at 0x10fb6b588>

In [13]:
tigger2=tigger
tigger2.sound='purrrrr'
tigger.say()

A tiger says purrrrr


** Exercise 1**

Rewrite the Animal class to include a method named `intro` which takes another Animal object as its argument.  This method will print a formal salutation, the object animal introducing themselves by species and using the other animals species.  If the animals are the same species have the salutation include the sound.

```
def intro(self,other):
    ...

bear1.intro(bear2) 
----> "Hey bear, I'm a bear too.  Dude...grrr."

bear1.intro(cat1) 
----> "Whazzup cat, I'm a bear."
```


In [52]:
#Your code here
class Animal():
    '''This is the documentation'''
    
    animal_list=[]
    
    @classmethod
    def print_animals(cls):
        for elem in Animal.animal_list:
            print(elem)
    
    def __init__(self,a_species,a_sound):#constructor
        Animal.animal_list.append(a_species)
        self.species=a_species
        self.sound=a_sound
        
    def say(self):
        print("A",self.species,"says",self.sound)  
        
    def intro(self, other):
        if self.species == other.species:
            return("Hey, "+other.species+", I'm a "+self.species+" too. Dude..."+self.sound)
        else:
            return("Whazzup %s, I'm a %s" %(other.species, self.species))
        


In [53]:
tigger=Animal("tiger","roar")#using the constructor
teddy=Animal("bear","grrr")
Animal.animal_list

['tiger', 'bear']

In [54]:
tigger.intro(tigger)

"Hey, tiger, I'm a tiger too. Dude...roar"

In [57]:
print(tigger.intro(teddy))
print(teddy.intro(teddy))

Whazzup bear, I'm a tiger
Hey, bear, I'm a bear too. Dude...grrr


** Example 2 **

Create an instance of a list:

In [58]:
li = [1, 2, 3, 4]

Checking its type:

In [59]:
type(li)

list

In [60]:
isinstance(li,list)

True

Checking its attributes

In [61]:
print(li.__doc__)

list() -> new empty list
list(iterable) -> new list initialized from iterable's items


Check its behavior when passed to the function `len()`

In [63]:
print(li.__len__())
print(len(li))

4
4


Check the number of times the value 1 appears in the list

In [64]:
print (li.count(1) ) 

1


In [65]:
li.append(3)# object method
li

[1, 2, 3, 4, 3]

In [66]:
list.append(li,5) #class method
li

[1, 2, 3, 4, 3, 5]

In [69]:
help(str.find)

Help on method_descriptor:

find(...)
    S.find(sub[, start[, end]]) -> int
    
    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.
    
    Return -1 on failure.



**Example 3**

### Inheritance

* Classes can inherit attributes and methods from other classes, in this case `Pet` **inherits** from the `Animal` class.
* `Pet` is said to be a subclass of `Animal`, `Animal` is a superclass of `Pet`.


The class declaration for a subclass uses the superclass as an argument while the constructor for the subclass passes parameters to the superclass constructor, as follows: 

In [70]:
class Pet(Animal): 
    
    def __init__(self, species, sound, name):
        #super().__init__(species,sound)
        Animal.__init__(self,species,sound) 
        self.name = name
              

In [71]:
pup=Pet('canine','woof','Spot')

In [72]:
pup.say()

A canine says woof


In [73]:
pup.name

'Spot'

The subclass can overwrite inherited attributes and methods of the superclass.

In [74]:
class Pet(Animal): 
    
    def __init__(self, species, sound, name):
        #super().__init__(species,sound)
        Animal.__init__(self,species,sound)
        self.name = name
        
    def say(self):
        print(self.name,"the",self.species,"says",self.sound)

In [75]:
kit=Pet('cat','meow','Tom')
kit.say() #use object method

Tom the cat says meow


In [76]:
Pet.say(kit) #use the associated subclass method

Tom the cat says meow


In [77]:
Animal.say(kit) #use the associated superclass method

A cat says meow


`kit` is an instance of the subclass `Pet` and the superclass 

In [78]:
isinstance(kit,Animal)

True

In [79]:
isinstance(kit,Pet)

True

In [80]:
isinstance(tigger,Pet)

False

Consider what happens when the object itself is printed, as follows:

In [81]:
print(tigger)

<__main__.Animal object at 0x1112002b0>


In [82]:
tigger

<__main__.Animal at 0x1112002b0>

In [83]:
str(tigger)

'<__main__.Animal object at 0x1112002b0>'

The behavior of an object in the `str` or `print` function can be changed using the `__str__` method.

In [84]:
class Animal():
    '''This is the documentation'''
    
    animal_list=[]
    @classmethod
    def print_animals(cls):
        for elem in Animal.animal_list:
            print(elem)
    
    def __init__(self,a_species,a_sound):
        Animal.animal_list.append(a_species)
        self.species=a_species
        self.sound=a_sound
        
    def say(self):
        print("A",self.species,"says",self.sound)  
        
    def __str__(self):
        return "This %s is one of %i animals." %(self.species,len(Animal.animal_list))

In [85]:
tigger=Animal("tiger","roar")
teddy=Animal("bear","grrr")
print(tigger)

This tiger is one of 2 animals.


In [86]:
tigger

<__main__.Animal at 0x11122a278>

In [87]:
str(tigger)

'This tiger is one of 2 animals.'

In [88]:
class Pet(Animal): 
    
    def __init__(self, species, sound, name):
        Animal.__init__(self,species,sound)
        self.name = name
        
    def say(self):
        print(self.name,"the",self.species,"says",self.sound)

In [90]:
kit=Pet('cat','meow','Tom')
print(kit)
str(kit)

This cat is one of 4 animals.


'This cat is one of 4 animals.'

** Exercise 2**

* Create a `Dog` class which is a subclass of `Pet`
* Make it have a `species` value "canine" and sound "woof"
* Give it an additional object attribute `age`
* Give it an additional object method `dog_age` which returns it's age times seven
* Overwrite the `__str__` method so it returns a string stating the name and that it is an old dog, if it is over 10, and a puppy if it is under 3.
* Create an instance of `Dog` and test its methods.


In [112]:
# your code here     
class Dog(Pet):
    def __init__(self, name, age):
        Pet.__init__(self, "canine", "woof", name)
        self.age = age
        
    def dog_age(self):
        return self.age*7
    
    def __str__(self):
        if self.age >= 10:
            return "My name is %s, I'm an old dog" %self.name
        elif self.age <= 3:
            return "My name is %s, I'm a puppy" %self.name
        else:
            return "My name is %s" %self.name

            

In [113]:
dog1=Dog("Bea",15)
dog2=Dog("Pika",7)
dog3=Dog("Fido",1)

In [114]:
print(dog1)
print(dog2)
print(dog3)

My name is Bea, I'm an old dog
My name is Pika
My name is Fido, I'm a puppy


In [115]:
print(dog1.dog_age())
print(dog2.dog_age())
print(dog3.dog_age())

105
49
7


In [117]:
dog1.nickname = "B" #creating a new attribute -only for that instance-
dog1.nickname

'B'

In [124]:
Dog.__bases__

(__main__.Pet,)

## String Class and Objects
### Creating strings

Here are examples of strings.  Note the following:

* Use of single, double and triple quotation marks.
* Use of \n newline character
* Use of \ escape character
* Use of the `r` prefix, indicating "raw" characters

In [125]:
type('hi')

str

In [31]:
print ("I'm Jack.")
print ('Jack says "My name is Jack".')
print ("Jack says \"Hi, I'm  Jack\"")
print ('Jack says "Hi, I\'m  Jack"')
print ('''Jack says "Hi, I'm  Jack"''')
print ("""Jack says 'Hi, I'm  Jack'""")
print ('''Jack says:
    "Hi, I'm  Jack"''')
print ('Hi, I do not want a \new line.')
print (r'Hi, I do not want a \n new line.')

I'm Jack.
Jack says "My name is Jack".
Jack says "Hi, I'm  Jack"
Jack says "Hi, I'm  Jack"
Jack says "Hi, I'm  Jack"
Jack says 'Hi, I'm  Jack'
Jack says:
    "Hi, I'm  Jack"
Hi, I do not want a 
ew line.
Hi, I do not want a \n new line.


## Basic String Manipulations

Case conversion is done as follows:

In [127]:
'ABcd'.lower() #object method

'abcd'

In [128]:
str.lower('ABcd') #class method

'abcd'

In [129]:
'ABcd'.upper() # object method

'ABCD'

In [130]:
str.upper('ABcd') #class method

'ABCD'

In [131]:
'ABcd'.swapcase() #object method

'abCD'

In [132]:
str.swapcase('ABcd') #class method

'abCD'

In [133]:
'acD acd'.title() #object method

'Acd Acd'

In [134]:
str.title('acD acd') #class method

'Acd Acd'

Additionally, string objects have the following object methods:

* split
* replace
* count
* join
* strip

In [135]:
'a b c d'.split(' ') # split by ' '

['a', 'b', 'c', 'd']

In [190]:
'a b c d'.split() # split by ' ' default

['a', 'b', 'c', 'd']

In [137]:
'a b c d'.replace(' ', '>') # replace ' ' with '>'

'a>b>c>d'

In [138]:
'a b c d'.count(' ') # count the number of ' ' appears

3

In [139]:
' '.join(['a', 'b', 'c'])

'a b c'

In [140]:
''.join(['a', 'b', 'c'])

'abc'

In [141]:
'/'.join(['a', 'b', 'c'])

'a/b/c'

In [142]:
str.join('X',['a', 'b', 'c'])

'aXbXc'

In [143]:
'X'.join(['a', 'b', 'c'])

'aXbXc'

In [144]:
'\n hey hey   yo .  '.strip()

'hey hey   yo .'

In [147]:
help(''.strip)

Help on built-in function strip:

strip(...) method of builtins.str instance
    S.strip([chars]) -> str
    
    Return a copy of the string S with leading and trailing
    whitespace removed.
    If chars is given and not None, remove characters in chars instead.



Extending the string class can be done as follows.  Since `str` is a builtin class there is no `__init__` object constructor written.

In [148]:
class mystr(str):
            
    def funk(self):
        return "funky "+ self.upper() +"!"
        

The `funk` method is added to the set of `mystr` methods.  All of the `str` methods are also available.

In [149]:
s=mystr('oompa DOOMpAh')
print(s.lower())
print(s.funk())
print(mystr.lower(s))
print(str.lower(s))
print(mystr.funk(s))

oompa doompah
funky OOMPA DOOMPAH!
oompa doompah
oompa doompah
funky OOMPA DOOMPAH!


**Exercise 3**
Rewrite the `mystr` class to include the following methods.  

* `replace_all()`  This method should function like `replace` except take a list of strings and list of replacements as parameters.

```
mystr.replace_all('r&d vs r&b',['&','vs'],[' and ','versus'])
----> 'r and d versus r and b'

```

* `split_all()`  This method should function like `split` except take a list of seperators as a parameter.

```
mystr.split_all('r&d vs r&b',['&',' vs '])
----> ['r','d','r','b']

```

In [195]:
# Your code here

class mystr(str):
    
    def replace_all(self, char_to_replace, replacements):
        for i in range(len(char_to_replace)):
            self = self.replace(char_to_replace[i], replacements[i%len(replacements)])
        return self
    
    def split_all(self, char_to_split):
        self = self.replace_all(char_to_split,char_to_split[0])
        return self.split(char_to_split[0])    
        


In [196]:
S=mystr('r&d vs r&b')
print(S.replace_all(['&','vs'],[' and ']) )
print(S)
print(S.split_all(['&',' vs ']) )
print(S)

r and d  and  r and b
r&d vs r&b
['r', 'd', 'r', 'b']
r&d vs r&b


## Files

Working with files is fairly straightforward in Python.

In [197]:
f = open('fooz.txt', 'w') 


In [198]:
print (type(f))


<class '_io.TextIOWrapper'>


In [199]:
print(f.__doc__)

Character and line based layer over a BufferedIOBase object, buffer.

encoding gives the name of the encoding that the stream will be
decoded or encoded with. It defaults to locale.getpreferredencoding(False).

errors determines the strictness of encoding and decoding (see
help(codecs.Codec) or the documentation for codecs.register) and
defaults to "strict".

newline controls how line endings are handled. It can be None, '',
'\n', '\r', and '\r\n'.  It works as follows:

* On input, if newline is None, universal newlines mode is
  enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
  these are translated into '\n' before being returned to the
  caller. If it is '', universal newline mode is enabled, but line
  endings are returned to the caller untranslated. If it has any of
  the other legal values, input lines are only terminated by the given
  string, and the line ending is returned to the caller untranslated.

* On output, if newline is None, any '\n' characters writt

In [200]:
!ls

PDA_lec2-exercise.ipynb      lec1exercise_solutions.ipynb
PDA_lec2.ipynb               [31moldmanandthesea.txt[m[m
fooz.txt                     record.txt


When open a file, the mode can be

* 'r': reading (default)
* 'w': writing
* 'a': appending
* 'r+': opens the file for both reading and writing

In this case the mode is 'w', so we either overwrite the file "fooz.txt" or create a new file "fooz.txt".

## Writing Files

In [201]:
f.write('this is some text.\nThis is another line.\n')
f.write('oh snap, more lines.\nIt\'s a line.\n')
f.close()

**Do not forget to close the file after using it.** Otherwise, you can not open the file with the other applications.

Print out the contents by using a shell command, as follows:

In [202]:
!cat fooz.txt  # more instead of cat for a pc

this is some text.
This is another line.
oh snap, more lines.
It's a line.


## Reading Files

The `read` method returns the file contents as a string.

In [203]:
f = open('fooz.txt', 'r') 
f.read()

"this is some text.\nThis is another line.\noh snap, more lines.\nIt's a line.\n"

Running `f.read` the second time will return nothing because all the contents have already been printed out.

In [204]:
f.read()

''

In [205]:
f.close()

In [207]:
f.read() # ValueError: I/O operation on closed file

ValueError: I/O operation on closed file.

## Reading part of the file

In [208]:
f = open('fooz.txt', 'r')

In [209]:
f.read(5) # the first 5 bytes

'this '

In [210]:
f.read(5) # the next 5 bytes

'is so'

In [211]:
f.read() # the remaining contents

"me text.\nThis is another line.\noh snap, more lines.\nIt's a line.\n"

In [212]:
f.close()
help(f.read)

Help on built-in function read:

read(size=-1, /) method of _io.TextIOWrapper instance
    Read at most n characters from stream.
    
    Read from underlying buffer until we have n characters or we hit EOF.
    If n is negative or omitted, read until EOF.



## Reading Files
Another way to read files is using the method **readlines**.  This method returns a list in which each element refers to one line.

In [213]:
f = open('fooz.txt', 'r')
lines=f.readlines()
print(lines)
f.close()

['this is some text.\n', 'This is another line.\n', 'oh snap, more lines.\n', "It's a line.\n"]


In [214]:
help(f.readline)

Help on built-in function readline:

readline(size=-1, /) method of _io.TextIOWrapper instance
    Read until newline or EOF.
    
    Returns an empty string if EOF is hit immediately.



## Iterating a file object
The file object is iterable.

In [215]:
lines

['this is some text.\n',
 'This is another line.\n',
 'oh snap, more lines.\n',
 "It's a line.\n"]

In [216]:
f = open('fooz.txt', 'r')
for i in f:
    print(i)
f.close()

this is some text.

This is another line.

oh snap, more lines.

It's a line.



Overwrite the file "fooz.txt" by using the following shell command.

In [217]:
!echo "One line.\nTwo line.\nThird line.\nFourth line." > fooz.txt #use >> to just add lines at the end

In [222]:
!cat fooz.txt

One line.
Two line.
Third line.
Fourth line.


In [225]:
print(enumerate(open('fooz.txt', 'r')))

<enumerate object at 0x111271438>


In [220]:
# this reads the even lines and writes them to a new file
f1 = open('fooz.txt', 'r')
f2 = open('fooz2.txt', 'w')
for i, elem in enumerate(f1):
    if i%2==0:
        f2.write(elem)    
f1.close()
f2.close()
! cat fooz2.txt

One line.
Third line.


**Exercise 4**

Write a function named `text_cleaner()` that takes three parameters: a file name to read, a filename to write and a list of punctuation marks to remove.  The function should read a file and write the contents to another file.  The text should have all given puctuation marks removed and all strings in lowercase.


In [234]:
#Your code here

def text_cleaner(file1, file2, punctuation):
    f1 = open(file1, 'r')
    f2 = open(file2, 'w')
    for i, elem in enumerate(f1):
        elem = elem.lower()
        elem = mystr(elem).replace_all(punctuation, [''])
        f2.write(elem)    
f1.close()
f2.close()

In [235]:
!cat fooz.txt

One line.
Two line.
Third line.
Fourth line.


In [236]:
text_cleaner('fooz.txt','clean.txt', ['.', '#'])

In [237]:
!cat clean.txt

one line
two line
third line
fourth line


## Preliminary Topics 

### Sort a dictionary

Since dictionary is unordered, converting it to a list by using .items method is necessary.

Then the function **sorted** can be used to sort a list.

In [238]:
d = {'a': 5, 'b': 3, 'c':2}
d = d.items() # convert to a list
print(d)
print(list(d))

dict_items([('a', 5), ('b', 3), ('c', 2)])
[('a', 5), ('b', 3), ('c', 2)]


In [239]:
sorted(d)

[('a', 5), ('b', 3), ('c', 2)]

In [240]:
sorted(d, key = lambda x: x[1])

[('c', 2), ('b', 3), ('a', 5)]

* The argument `key` in sorted need a function which is callable.
* Here `lambda x: x[1]` is a anonymous function, which is similar to the function defined by the keyword `def`.
* This whole command means sort the list by the second element.


**Exercise 5**

Write a function named `wordCounter()` that takes a file name as an argument.  The function should:

* open the file
* strip every line
* remove punctuation marks
* convert all words to lowercase
* create a word frequency dictionary
* return a sorted list version of the dictionary



In [304]:
#your code here
def countWord(str):
    str = str.lower()
    words_frequencies = {}
    current_word = ''
    separators = string.punctuation + ' '
    i=0
    # get to the first word
    while i < len(str):
        while i < len(str) and str[i] in separators:
            i+=1
        while i < len(str) and str[i] not in separators:
            current_word = current_word+str[i]
            i+=1
        if current_word != '': #needed for the last loop
            try:
                words_frequencies[current_word]+=1
            except KeyError:
                words_frequencies[current_word]=1
        current_word = ''
    return words_frequencies

def wordCounter(file_name):
    dictio = {}
    f = open(file_name, 'r',encoding='utf-8', errors='ignore')
    for line in f.readlines():
        line = line.strip()
        line = mystr.replace_all(line, ".,?!\';:", [' '])
        line = line.lower()
        line = mystr.split_all(mystr(line), [' '])
        for word in line:
            if word in dictio and word != '':
                dictio[word]+=1
            else:
                dictio[word]=1
    f.close()
    dictio = dictio.items()
    return sorted(list(dictio), key = lambda x: x[1], reverse = True)

wordCounter('fooz.txt') 

[('line', 4), ('one', 1), ('', 1), ('two', 1), ('third', 1), ('fourth', 1)]

In [305]:
wordCounter('oldmanandthesea.txt')

[('the', 2330),
 ('and', 1260),
 ('he', 1137),
 ('of', 547),
 ('it', 479),
 ('to', 458),
 ('his', 446),
 ('"', 441),
 ('was', 436),
 ('i', 406),
 ('a', 397),
 ('in', 365),
 ('that', 296),
 ('fish', 280),
 ('man', 268),
 ('old', 252),
 ('him', 230),
 ('but', 215),
 ('with', 205),
 ('not', 203),
 ('had', 203),
 ('on', 203),
 ('as', 197),
 ('is', 193),
 ('said', 189),
 ('you', 171),
 ('thought', 165),
 ('now', 162),
 ('then', 146),
 ('they', 144),
 ('have', 143),
 ('for', 139),
 ('were', 138),
 ('line', 138),
 ('could', 116),
 ('water', 105),
 ('out', 104),
 ('when', 103),
 ('"i', 102),
 ('at', 100),
 ('boy', 100),
 ('them', 97),
 ('there', 96),
 ('all', 96),
 ('would', 95),
 ('from', 94),
 ('will', 90),
 ('hand', 90),
 ('down', 89),
 ('do', 88),
 ('be', 87),
 ('back', 85),
 ('up', 83),
 ('so', 83),
 ('s', 82),
 ('no', 77),
 ('can', 77),
 ('if', 75),
 ('one', 73),
 ('must', 72),
 ('too', 70),
 ('see', 67),
 ('again', 64),
 ('great', 64),
 ('are', 62),
 ('head', 62),
 ('what', 61),
 ('more

In [None]:
!cat 'fooz.txt'

In [281]:
help(sorted)

Help on built-in function sorted in module builtins:

sorted(iterable, /, *, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    
    A custom key function can be supplied to customize the sort order, and the
    reverse flag can be set to request the result in descending order.



## Script
### Running Python scripts
Python is a script language, which means we can run a script as a shell command.
Here is simple example, write a line "print '1 + 1 = %s' %(1+1)" into the file "script1.py":

In [277]:
!mkdir script

In [278]:
!echo "print ('1 + 1 = %s' %(1+1))" > script/script1.py

In [279]:
!cat script/script1.py

print ('1 + 1 = %s' %(1+1))


In [280]:
!python script/script1.py

1 + 1 = 2


## Run Python scripts
In practice, we usually need to interact with scripts.

- Take in some inputs.
- Run the script.
- Return some outputs.


In [None]:
!mkdir script

In [291]:
open('script/script.py', 'w').write('''from sys import argv
script, name, age = argv
print ("The script is called:", script)
print ("My name is %s." %name)
print ("Im %s years old." %age)''')

148

In [292]:
!cat script/script.py

from sys import argv
script, name, age = argv
print ("The script is called:", script)
print ("My name is %s." %name)
print ("Im %s years old." %age)

In [293]:
!python script/script.py jack 18

('The script is called:', 'script/script.py')
My name is jack.
Im 18 years old.


In [294]:
!python script/script.py Lucy 20

('The script is called:', 'script/script.py')
My name is Lucy.
Im 20 years old.


In [295]:
%%!
cd script
ls

['script.py', 'script1.py']

**Exercise 6**

Write a script named `wordcounter.py` which takes the name of a file and integer (N) and prints out the top N words and their frequency.

```
from sys import argv
script, target, number = argv

...
```

In [300]:
# write trial code here then create a file in the script folder and run it



-f


FileNotFoundError: [Errno 2] No such file or directory: '-f'

In [347]:
!python wordCounter.py oldmanandthesea.txt 20

('the', 2330)
('and', 1260)
('he', 1137)
('of', 547)
('it', 479)
('to', 458)
('his', 446)
('"', 441)
('was', 436)
('i', 406)
('a', 397)
('in', 365)
('that', 296)
('fish', 280)
('man', 268)
('old', 252)
('him', 230)
('but', 215)
('with', 205)
('had', 203)


In [350]:
!python wordCounter.py les_miserables.txt 100

('the', 40375)
('of', 19899)
('and', 14737)
('a', 14287)
('to', 13617)
('in', 11091)
('he', 9484)
('was', 8607)
('that', 7657)
('"', 6629)
('his', 6434)
('it', 6392)
('had', 6171)
('is', 6115)
('which', 5126)
('with', 4512)
('on', 4443)
('at', 4025)
('this', 3906)
('not', 3772)
('you', 3396)
('one', 3286)
('as', 3231)
('i', 3104)
('for', 2934)
('him', 2932)
('have', 2747)
('her', 2626)
('there', 2552)
('who', 2483)
('from', 2436)
('all', 2431)
('she', 2390)
('be', 2365)
('by', 2359)
('are', 2123)
('an', 2097)
('they', 2072)
('s', 2064)
('man', 1963)
('but', 1945)
('no', 1859)
('were', 1821)
('said', 1793)
('been', 1516)
('what', 1493)
('when', 1341)
('marius', 1331)
('their', 1246)
('we', 1232)
('will', 1207)
('two', 1172)
('jean', 1166)
('me', 1144)
('so', 1123)
('more', 1121)
('valjean', 1118)
('himself', 1083)
('my', 1075)
('has', 1070)
('them', 1066)
('would', 1044)
('then', 1010)
('--', 1000)
('these', 988)
('into', 

## Regular Expressions
Basic and intermediate functions for working with strings have been covered.

To fully unleash the power of strings manipulation, it's necessary to learn regular expressions.

### Concept
A regular expression is a special text string for describing a set of strings. This "special string" is formally called a **pattern**. Hence, a regular expression is a pattern that describes a set of strings.

The goal of using regular expression is extracting specific characters from text by describing its pattern.

### Pattern
For example, both **gray** and **grey** match the pattern **gr.y** in which the dot . refers to an arbitrary character.

<code>
Identifiers:

\d = any number
\D = anything but a number
\s = space
\S = anything but a space
\w = any letter
\W = anything but a letter
. = any character, except for a new line
\b = space around whole words
\\. = period. must use backslash, because . normally means any character.

Modifiers:

{1,3} = for digits, u expect 1-3 counts of digits, or "places"
\+ = match 1 or more
? = match 0 or 1 repetitions.
\* = match 0 or MORE repetitions
$ = matches at the end of string
^ = matches start of a string
| = matches either/or. Example x|y = will match either x or y
[] = range, or "variance"
{x} = expect to see this amount of the preceding .
{x,y} = expect to see this x-y amounts of the precedng 

White Space Charts:

\n = new line
\s = space
\t = tab
\e = escape
\f = form feed
\r = carriage return

Characters to REMEMBER TO ESCAPE IF USED!

. + * ? [ ] $ ^ ( ) { } | \

Brackets:

[] = quant[ia]tative = will find either quantitative, or quantatative.
[a-z] = return any lowercase letter a-z
[1-5a-qA-Z] = return all numbers 1-5, lowercase letters a-q and uppercase A-Z
</code>

## re
The library **re** is used to implement regular expressions in python.

In [8]:
import re

In [9]:
help(re.search)

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a match object, or None if no match was found.



In [10]:
raw_string = 'Hi, how are you today?'
print (re.search('Hi', raw_string))

<_sre.SRE_Match object; span=(0, 2), match='Hi'>


In [11]:
print (re.search('Hello', raw_string))

None


In [15]:
s = re.search('Hi', raw_string)
s != None

True

In [13]:
s = re.search('Hey', raw_string)
s != None
print(s)

None


In [16]:
print (s.start()) # the starting position of of the matched string
print (s.end())   # the ending position index of the matched string
print (s.span())  # a tuple containing the (start, end) positions of the matched string

0
2
(0, 2)


In [None]:
print( s.group() )# the matched string
print( raw_string[s.start():s.end()] )# same

## Using Identifiers

In [329]:
#     . = any character, except for a new line

print (re.search('a.', 'aa') != None)
print (re.search('a.', 'ab') != None)
print (re.search('a.', 'a1') != None)
print (re.search('a.', 'a#') != None)
print (re.search('a.', 'a') != None)
print (re.search('a.', 'a\n') != None)

True
True
True
True
False
False


In [17]:
#      ? = match 0 or 1 repetitions.

print (re.search('ba?b', 'bb') != None )   # match
print (re.search('ba?b', 'bab') != None )  # match
print (re.search('ba?b', 'baab') != None ) # does not match
print (re.search('ba?b', 'abab') != None )
print (re.search('ba??b', 'abab'))

True
True
False
True
<_sre.SRE_Match object; span=(1, 4), match='bab'>


In [331]:
#     + = match 1 or more

print( re.search('ba+b', 'bb') != None  )  # does not match
print( re.search('ba+b', 'bab') != None   )# match
print( re.search('ba+b', 'baab') != None  )# match
print( re.search('ba+b', 'baaaab') != None  )# match
print( re.search('ba+b', 'baaaaaab') != None ) # match

False
True
True
True
True


In [332]:
#.       * = match 0 or MORE repetitions

print (re.search('ba*b', 'bb') != None )   # match
print (re.search('ba*b', 'bab') != None )  # match
print (re.search('ba*b', 'baaaaaab') != None ) # match
print (re.search('ba*b', 'baaagb') != None )

True
True
True
False


In [333]:
print( re.search('ba{1,3}b', 'bab') != None    )# match
print( re.search('ba{1,3}b', 'baab') != None  ) # match
print( re.search('ba{1,3}b', 'baaab') != None)  # match
print( re.search('ba{1,3}b', 'bb') != None     )# does not match
print( re.search('ba{1,3}b', 'baaaab') != None) # does not match


True
True
True
False
False


In [334]:
#      ^ = matches start of a string

print( re.search('^a', 'abc') != None )   # match
print( re.search('^a', 'abcde') != None)  # match
print( re.search('^a', ' abcde') != None) # does not match

True
True
False


In [335]:
#.      $ = matches at the end of string

print (re.search('a$', 'aba') != None )   # match
print (re.search('a$', 'abcba') != None)  # match
print (re.search('a$', ' aba ') != None ) # does not match

True
True
False


In [336]:
print (re.search('^a.a$', 'aba') != None )   # match
print (re.search('^a.a$', 'a a') != None  )  # match
print (re.search('^a.a$', 'a.a') != None   ) # match

print (re.search('^a.a$', 'bba') != None  )  # does not match
print (re.search('^a.a$', 'abba') != None  ) 
print (re.search('^a.*a$', 'abba') != None  )
print (re.search('^a.?a$', 'abba') != None  )
print (re.search('^a.{1,3}a$', 'abba') != None  )#match

True
True
True
False
False
True
False
True


### brackets

[] is used for specifying a set of characters to match. For example, [123abc] will match any of the characters 1, 2, 3, a, b, or c ; this is the same as [1-3a-c], which uses a range to express the same set of characters. Further more [a-z] matches all the lower letters, while [0-9] matches all the numbers.

In [337]:
print (re.search('[123abc]', 'defg')  != None)   # does not match
print (re.search('[123abc]', '1defg') != None )  # match
print (re.search('[1-3a-c]', '2defg') != None  ) # match
print (re.search('[123abc]', 'adefg') != None )  # match
print (re.search('[1-3a-c]', 'bdefg') != None )  # match

False
True
True
True
True


### parentheses

() is very similar to the mathematical meaning, they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier.
For example, the pattern (abc){2,3} matches abc 2 or 3 times.

In [340]:
print (re.search('(abc){2,3}', 'abc')  != None  )       # does not match
print (re.search('(abc){2,3}', 'abcabc')  != None)      # match
print (re.search('(abc){2,3}', 'abcabcabc')  != None )  # match
print (re.search('(abc){2}', 'abcabcabc')  != None )  # 
print (re.search('^(abc){1,2}$', 'abcabcabc')  != None )  # 

False
True
True
True
False


In [341]:
re.search('(abc){2,3}', 4*'abc')  != None

True

In [342]:
re.search('^(abc){2,3}$', 4*'abc')  != None

False

In [343]:
print(re.search('[ab]', 'a') != None)   # match
print(re.search('[ab]', 'b') != None)   # match
print(re.search('[ab]', 'c') != None)  # does not match

True
True
False


### vertical bar
`|` is a logical operator. For examples, `a|b` matches `a` and `b`, which is similar to `[ab]`.

`abc|123` matches `abc` or `123`, while `[abc123]` matches any single characters in `a, b, c, 1, 2, 3`.

In [344]:
print (re.search('abc|123', 'a') != None)   # does not match
print (re.search('abc|123', '1') != None )  # does not match
print (re.search('abc|123', '123') != None) # match
print (re.search('abc|123', 'abc') != None) # match

False
False
True
True


### backslash
To match exactly ?, it is necessary to add a backslash \?.

Otherwise, the character ? will be treated as an identifier. ? matches a character(group) either once or zero times.

In [345]:
print (re.search('\?', 'Hi, how are you today?') != None)

True


In [346]:
print (re.search('y?', 'Hi, how are you today?') != None)

True


## Useful functions
* **re.split(pattern, string)**: Split the string into a list by the pattern.
* **re.sub(pattern, replace, string)**: Replace the substrings in the string that matches the pattern with the argument replace.
* **re.findall(pattern, string)**: Find all substrings where the pattern matches, and returns them as a list.

In the base library, the strings already have methods like `str.split` and `str.replace` do similar works.

`str.split` is similar to `re.split`, `str.replace` is similar to `re.sub`.

Since regular expressions can be used in the `re` module, `re.split` and `re.sub` are much more powerful!

## re.sub

In [18]:
s = '''The re module was added in Python 1.5, 
and provides Perl-style regular expression patterns. 
Earlier versions of Python came with the regex module, 
which provided Emacs-style patterns. 
The regex module was removed completely in Python 2.5.'''

In [19]:
s

'The re module was added in Python 1.5, \nand provides Perl-style regular expression patterns. \nEarlier versions of Python came with the regex module, \nwhich provided Emacs-style patterns. \nThe regex module was removed completely in Python 2.5.'

#### Problem
Suppose the goal is to split this sentence into a list in which each element is a word. The separators are dot(.), dash(-), comma(,) and blank space( ).

#### Solution
How to solve the problem in the base library?

- Since we can not split a string by multiple separators, an alternative way is replacing all the separators with blank space.
- Split the replaced the text with blank space.

This technique was used in the script wordcounter.py.

In [20]:
s2 = s
for i in [',', '.', '-', '\n']:
    s2 = s2.replace(i, ' ')
s2.split(' ')

['The',
 're',
 'module',
 'was',
 'added',
 'in',
 'Python',
 '1',
 '5',
 '',
 '',
 'and',
 'provides',
 'Perl',
 'style',
 'regular',
 'expression',
 'patterns',
 '',
 '',
 'Earlier',
 'versions',
 'of',
 'Python',
 'came',
 'with',
 'the',
 'regex',
 'module',
 '',
 '',
 'which',
 'provided',
 'Emacs',
 'style',
 'patterns',
 '',
 '',
 'The',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in',
 'Python',
 '2',
 '5',
 '']

Using regular expression, all the separators can be replaced at the same time.

In [None]:
s3 = s
s3 = re.sub('[\n,.-]', ' ', s3)
print (s3)
re.split(' +', s3) 
# since there are empty characters in the result,
# \ we split it by one or more blank space

### re.split
A simpler way is using regular expression to split the text by multiple separators directly.

In [21]:
re.split('[\n ,\.-]+', s)

['The',
 're',
 'module',
 'was',
 'added',
 'in',
 'Python',
 '1',
 '5',
 'and',
 'provides',
 'Perl',
 'style',
 'regular',
 'expression',
 'patterns',
 'Earlier',
 'versions',
 'of',
 'Python',
 'came',
 'with',
 'the',
 'regex',
 'module',
 'which',
 'provided',
 'Emacs',
 'style',
 'patterns',
 'The',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in',
 'Python',
 '2',
 '5',
 '']

### re.findall
Similar to **`re.split`**, **`re.findall`** also works well in this case.

Select letters in the string s by using **`re.findall`**.

In [22]:
re.findall('[a-zA-Z]+', s) # if you want number too, run re.findall('[a-zA-Z0-9]+', s)

['The',
 're',
 'module',
 'was',
 'added',
 'in',
 'Python',
 'and',
 'provides',
 'Perl',
 'style',
 'regular',
 'expression',
 'patterns',
 'Earlier',
 'versions',
 'of',
 'Python',
 'came',
 'with',
 'the',
 'regex',
 'module',
 'which',
 'provided',
 'Emacs',
 'style',
 'patterns',
 'The',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in',
 'Python']

### Special sequence in regular expression

There are sequences that have special meaning in regular expressions.

* \d: Matches any decimal digit; this is equivalent to the class [0-9].
* \D: Matches any non-digit character; this is equivalent to the class [^0-9].
* \w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
* \W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
* \s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
* \S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
* \t: tab,
* \v: vertical tab.
* \r: Carraige return. Move to the leading end (left) of the current line.
* \n: Line Feed. Move to next line, staying in the same column. Prior to Unix, usually used only after CR or LF.
* \f: Form feed. Feed paper to a pre-established position on the form, usually top of the page.

Reference: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1

The simplest way to solve the problem is as follows:

In [23]:
re.findall('\w+', s) # same as re.findall(`[a-zA-Z0-9_]+`, s)

['The',
 're',
 'module',
 'was',
 'added',
 'in',
 'Python',
 '1',
 '5',
 'and',
 'provides',
 'Perl',
 'style',
 'regular',
 'expression',
 'patterns',
 'Earlier',
 'versions',
 'of',
 'Python',
 'came',
 'with',
 'the',
 'regex',
 'module',
 'which',
 'provided',
 'Emacs',
 'style',
 'patterns',
 'The',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in',
 'Python',
 '2',
 '5']

In [28]:
re.findall(r'\b[a-z]+', s) #Misses the first word because no space before it (The)

['re',
 'module',
 'was',
 'added',
 'in',
 'and',
 'provides',
 'style',
 'regular',
 'expression',
 'patterns',
 'versions',
 'of',
 'came',
 'with',
 'the',
 'regex',
 'module',
 'which',
 'provided',
 'style',
 'patterns',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in']

In [27]:
re.findall('[a-z]+', s)

['he',
 're',
 'module',
 'was',
 'added',
 'in',
 'ython',
 'and',
 'provides',
 'erl',
 'style',
 'regular',
 'expression',
 'patterns',
 'arlier',
 'versions',
 'of',
 'ython',
 'came',
 'with',
 'the',
 'regex',
 'module',
 'which',
 'provided',
 'macs',
 'style',
 'patterns',
 'he',
 'regex',
 'module',
 'was',
 'removed',
 'completely',
 'in',
 'ython']

### wordCount

Rewrite the function `wordCount`.

In [32]:
import re
def wordCount(x, number=False):
    '''
    x: string to count
    number: whether to count the numbers
    '''
    ## to lower and find words
    x = x.lower()
    if number:
        word_list = re.findall('\w+', x)
    else:
        word_list = re.findall('[a-zA-Z]+', x)
    ## count and return
    result = {}
    for word in word_list:
        if word in result:
            result[word] += 1
        else:
            result[word] = 1
    return result 

In [33]:
wordCount(s)

{'added': 1,
 'and': 1,
 'came': 1,
 'completely': 1,
 'earlier': 1,
 'emacs': 1,
 'expression': 1,
 'in': 2,
 'module': 3,
 'of': 1,
 'patterns': 2,
 'perl': 1,
 'provided': 1,
 'provides': 1,
 'python': 3,
 're': 1,
 'regex': 2,
 'regular': 1,
 'removed': 1,
 'style': 2,
 'the': 3,
 'versions': 1,
 'was': 2,
 'which': 1,
 'with': 1}