<a href="https://colab.research.google.com/github/JD950/Python_notebook/blob/main/TextAnalysisRegex_13.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Text Analysis

## What kinds of text analysis are there?

* analyst knows the pattern
    * regular expressions
* analyst does not know the pattern
    * natural language processing
        * compares historical examples to judge novel cases
            * comparisons are statistical and approximate
            

### Examples of Analysis

In [None]:
pattern = '£ ?[0-9][0-9]?' # £ then SPACE-optional then digit then digit-optional 

document = 'Eggs cost £3, bread cost £2, organic apple juice cost £5'

In [None]:
import re

In [None]:
re.findall(pattern, document)

['£3', '£2', '£5']

If you dont:

* sentiment analysis
    * how positive/negative is this (new) review?
* topic analysis 
    * what is this document about?

## What can I do if I know what pattern I want to find?

* finding ("extracting")
    * what matches the pattern?
* matching ("validating")
    * does the entire document match YES/NO?
* substitue ("replacing")
    * replace a part that matches a pattern with another...

## How do I validate text with pandas?

In [None]:
import pandas as pd

In [None]:
ti = pd.read_csv('datasets/titanic.csv')
ti.sample(1)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
607,1,1,male,27.0,0,0,30.5,S,First,man,True,,Southampton,yes,True


In [None]:
ti['ticket'] = "Ticket: " + ti['class'] + "; Price: $ " + ti['fare'].astype(str) + "; Port: " + ti['embark_town'] + ";"

In [None]:
ti[['class', 'fare', 'embark_town', 'ticket']].head(3)

Unnamed: 0,class,fare,embark_town,ticket
0,Third,7.25,Southampton,Ticket: Third; Price: $ 7.25; Port: Southampton;
1,First,71.2833,Cherbourg,Ticket: First; Price: $ 71.2833; Port: Cherbourg;
2,Third,7.925,Southampton,Ticket: Third; Price: $ 7.925; Port: Southampton;


In [None]:
pattern = '(First|Second)'

ti['class'].str.match(pattern)

0      False
1       True
2      False
3       True
4      False
       ...  
886     True
887     True
888    False
889     True
890    False
Name: class, Length: 891, dtype: bool

In [None]:
ti.loc[ ti['class'].str.match(pattern)  , 'survived'].mean()

0.5575

## How do I extract data with pandas?

In [None]:
ti[['class', 'fare', 'embark_town', 'ticket']].head(3)

Unnamed: 0,class,fare,embark_town,ticket
0,Third,7.25,Southampton,Ticket: Third; Price: $ 7.25; Port: Southampton;
1,First,71.2833,Cherbourg,Ticket: First; Price: $ 71.2833; Port: Cherbourg;
2,Third,7.925,Southampton,Ticket: Third; Price: $ 7.925; Port: Southampton;


In [None]:
pattern = '([0-9.]+)'

ti['ticket'].str.extract(pattern).sample(4)

Unnamed: 0,0
669,52.0
747,13.0
118,247.5208
352,7.2292


## How do I substitue text with pandas?

In [None]:
ti['ticket'].str.replace('$', '€').sample(1)

262    Ticket: First; Price: € 79.65; Port: Southampton;
Name: ticket, dtype: object

## What are regular expressions?

In [None]:
ti['ticket'].str.extract('(Ticket: (First|Second))')

Unnamed: 0,0,1
0,,
1,Ticket: First,First
2,,
3,Ticket: First,First
4,,
...,...,...
886,Ticket: Second,Second
887,Ticket: First,First
888,,
889,Ticket: First,First


In [None]:
ti['ticket'].str.extract('( [0-9][0-9])')

Unnamed: 0,0
0,
1,71
2,
3,53
4,
...,...
886,13
887,30
888,23
889,30


In [None]:
ti['ticket'].sample(1)

460    Ticket: First; Price: $ 26.55; Port: Southampton;
Name: ticket, dtype: object

In [None]:
ti['ticket'].str.extract('(Ticket: [A-Z])').sample(2)

Unnamed: 0,0
358,Ticket: T
453,Ticket: F


In [None]:
ti['ticket'].str.extract('(T........)').sample(3)

Unnamed: 0,0
636,Ticket: T
774,Ticket: S
40,Ticket: T


In [None]:
ti['ticket'].str.extract('(Price: [^0-9A-Za-z] ..)').sample(3)

Unnamed: 0,0
134,Price: $ 13
825,Price: $ 6.
246,Price: $ 7.


In [None]:
ti['ticket'].str.extract('(Port: (Cherbourg|Southampton))').sample(3)

Unnamed: 0,0,1
125,Port: Cherbourg,Cherbourg
91,Port: Southampton,Southampton
426,Port: Southampton,Southampton


* repetitions
    * optional `?`
        * an optional number: `[0-9]?`
    * one or more `+`
        * one or more spaces: ` +`  
    * optional, or more, `*`
        * ` [0-9][0-9]?.[0-9]*`
    

In [None]:
ti['ticket'].str.extract('([0-9][0-9]?.[0-9]*)').sample(3)

Unnamed: 0,0
149,13.0
166,55.0
527,221.0


In [None]:
ti['ticket'].str.extract('(Ticket: [a-zA-Z]+)').sample(3)

Unnamed: 0,0
450,Ticket: Second
783,Ticket: Third
9,Ticket: Second


In [None]:
row = 0
match = 1 # second match

ti['ticket'].str.extractall('([a-zA-Z]+: [a-zA-Z]+)').loc[row, match]

0    Port: Southampton
Name: (0, 1), dtype: object

In [None]:
ti['ticket'].str.extract('([a-zA-Z]+tow?n)')

Unnamed: 0,0
0,Southampton
1,
2,Southampton
3,Southampton
4,Southampton
...,...
886,Southampton
887,Southampton
888,Southampton
889,


* EXTRA: 
    * escaping
        * How do I say, literally, the `.` symbol?
        * `\.`
    

In [None]:
ti['ticket'].str.extract('(\$ [0-9]+\.[0-9]+)').sample(2)

Unnamed: 0,0
339,$ 35.5
732,$ 0.0


* positional matching
    * `^` means **at the beginning**
    * `$` means **at the end**

In [None]:
ti['ticket'].str.extractall('([a-zA-Z]+: [a-zA-Z]+;$)').sample(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
127,0,Port: Southampton;


In [None]:
ti['ticket'].str.extractall('(^[a-zA-Z]+: [a-zA-Z]+;)').sample(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
645,0,Ticket: First;


In [None]:
ti['ticket'].str.findall('([a-zA-Z]+[ :])').sample(10)

83     [Ticket:, Price:, Port:]
741    [Ticket:, Price:, Port:]
396    [Ticket:, Price:, Port:]
687    [Ticket:, Price:, Port:]
808    [Ticket:, Price:, Port:]
409    [Ticket:, Price:, Port:]
174    [Ticket:, Price:, Port:]
350    [Ticket:, Price:, Port:]
118    [Ticket:, Price:, Port:]
778    [Ticket:, Price:, Port:]
Name: ticket, dtype: object

In [None]:
ti['ticket'].str.extract('(\$ [0-9]+\.[0-9]+)').sample(2)

Unnamed: 0,0
4,$ 8.05
719,$ 7.775


In [None]:
row = 0
match = 1 # second match
ti['ticket'].str.extractall('(\$ [0-9][0-9][0-9]+\.[0-9]+)')#.loc[row, 0]