# Introduction to Text Mining
## What can be done with text
- Parse text;
- Find/Identify/Extract relevant information from text;
- Classify text documents;
- Search for relevant text documents;
- Sentiment analysis.

# Handling Text in Python
## Primitive constructs in Text
- Sentences/input strings;
- Words or Tokens;
- Characters;
- Document, larger files.

## Finding specific words
- Long words. Example for 3 letters
  ```python
  [w for w in text.split(' ') in len(w) > 3]
  ```
- Capitalized words
  ```python
  [w for w in tex.split(' ') if w.istitle()]
  ```
- Words ending in 's'
  ```python
  [w for w in tex.split(' ') if w.endswith('s')]
  ```

## Finding unique words: using set()
- set(text): return in a vector the unique words, but is case-sensistive;
- To ignore case differences:
  ```python
  set([w.lower() for w in text])
  ```

## Some word comparison functions
- s.startswith(t);
- s.endswith(t);
- t in s;
- s.isupper(), s.islower(), s.istitle();
- s.isalpha(), s.isdigit(), s.isalnum();

## String Operations
- s.lower(), s.upper(), s.titlecase();
- s.split('');
- s.splitlines();
- s.join();
- s.strip(), s.rstrip();
- s.find(t), s.rfind(t);
- s.replace(u, v).

## Handling larger texts
- Reading files line by line
  ```python
  f = open('filename','r')
  f.readline()
  ```
- Reading the full file
  ```python
  f.seek(0) # To return to the beginning of the file after the readline command
  f = open('filename','r')
  text = f.read()
  ```
- Writing in the file
  ```python
  f = open('filename','w')
  f.write(message)
  ```
- After handling the file, you need to run `f.close()`

# Regular Expressions
## Processing free-text

In [None]:
tweet = "@nltk Text analysis is awesome! #regex #pandas #python"
print([word for word in tweet.split() if word.startswith('#')])

['#regex', '#pandas', '#python']


## Finding specific words
- Hashtag:
  ```python
  [w for w in text if w.startswith('#')]
  ```
- Callouts:
  ```python
  import re
  [w for w in text if re.search('@[A-Za-z0-9_]+',w)]
  ```

## Meta-characters
### Character matches
- . : wildcard, matches a single character;
- ^ : start of a string;
- $ : end of string;
- [] : matches one of the set of characters within [];
- [^abc] : matches a character that is not a, b or c;
- a|b : matches either a or b, where a and b are strings;
- () : Scoping for operators;
- \ : Escape characters for special characters (\t, \n, \b);
- \b : Matches word boundary;
- \d : Any digit, equivalent to [0-9];
- \D : Any non-digit, equivalent to [^0-9];
- \s : Any withespace;
- \S : Any non-whitespace;
- \w : Alphanumeric character;
- \W : Non-alphanumeric.

### Repetitions
- \* : Matches zero or more occurrences;
- \+ : Matches one or more occurrences;
- ? : Matches zero or one occurrences;
- {n} : Exactly n repetitions;
- {n,} : At least n repetitions;
- {,n} : At most n repetitions;
- {m,n} : At least m and at most n repetitions

## Expression for Dates
- Example: date variations for 23rd October 2002
  - 23-10-2002;
  - 12/10/2022;
  - 23/10/02;
  - 10/23/2002;
  - 23 Oct 2002;
  - 23 October 2002;
  - Oct 23, 2002;
  - October 23, 2002.

- How to get all of them:
```python
re.findall(r'(?:\d{1,2} )?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]* (?:\d{1,2}, )?\d{4}', text)
```

# Working with Text Data in Pandas

In [None]:
import pandas as pd

time_sentences = ["Monday: The doctor's appointment is at 2:45pm.",
                  "Tuesday: The dentist's appointment is at 11:30 am.",
                  "Wednesday: At 7:00pm, there is a basketball game!",
                  "Thursday: Be back home by 11:15 pm at the latest.",
                  "Friday: Take the train at 08:10 am, arrive at 09:00am."]

df = pd.DataFrame(time_sentences, columns=['text'])
df

Unnamed: 0,text
0,Monday: The doctor's appointment is at 2:45pm.
1,Tuesday: The dentist's appointment is at 11:30...
2,"Wednesday: At 7:00pm, there is a basketball game!"
3,Thursday: Be back home by 11:15 pm at the latest.
4,"Friday: Take the train at 08:10 am, arrive at ..."


In [None]:
# Number of characters for each string
df['text'].str.len()

Unnamed: 0,text
0,46
1,50
2,49
3,49
4,54


In [None]:
# Find the number of words for each string
df['text'].str.split(' ').str.len()

Unnamed: 0,text
0,7
1,8
2,8
3,10
4,10


In [None]:
# Check if the string contains a pattern
df['text'].str.contains('appointment')

Unnamed: 0,text
0,True
1,True
2,False
3,False
4,False


In [None]:
# find how many times a digit occurs in each string
df['text'].str.count(r'\d')

Unnamed: 0,text
0,3
1,4
2,3
3,4
4,8


In [None]:
# Find what digits occur
df['text'].str.findall(r'\d+')

Unnamed: 0,text
0,"[2, 45]"
1,"[11, 30]"
2,"[7, 00]"
3,"[11, 15]"
4,"[08, 10, 09, 00]"


In [None]:
# Grouping hour and minutes
df['text'].str.findall(r'(\d?\d):(\d\d)')

Unnamed: 0,text
0,"[(2, 45)]"
1,"[(11, 30)]"
2,"[(7, 00)]"
3,"[(11, 15)]"
4,"[(08, 10), (09, 00)]"


In [None]:
df['text'].str.replace(r'\w+day', '???', regex=True)

Unnamed: 0,text
0,???: The doctor's appointment is at 2:45pm.
1,???: The dentist's appointment is at 11:30 am.
2,"???: At 7:00pm, there is a basketball game!"
3,???: Be back home by 11:15 pm at the latest.
4,"???: Take the train at 08:10 am, arrive at 09:..."


In [None]:
# Replace the weekday by the 3 letter abreviation
df['text'].str.replace(r'(\w+day\b)',lambda x: x.groups()[0][:3], regex=True)

Unnamed: 0,text
0,Mon: The doctor's appointment is at 2:45pm.
1,Tue: The dentist's appointment is at 11:30 am.
2,"Wed: At 7:00pm, there is a basketball game!"
3,Thu: Be back home by 11:15 pm at the latest.
4,"Fri: Take the train at 08:10 am, arrive at 09:..."


In [None]:
# Create new columns from first match of extracted groups
df['text'].str.extract(r'(\d?\d):(\d\d)')

Unnamed: 0,0,1
0,2,45
1,11,30
2,7,0
3,11,15
4,8,10


In [None]:
df['text'].str.extractall(r'(\d?\d):(\d\d) ?([ap]m)')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,2,45,pm
1,0,11,30,am
2,0,7,0,pm
3,0,11,15,pm
4,0,8,10,am
4,1,9,0,am


In [None]:
# Naming groups
df['text'].str.extractall(r'(?P<time>(?P<hour>\d?\d):(?P<minute>\d\d) ?(?P<period>[ap]m))')

Unnamed: 0_level_0,Unnamed: 1_level_0,time,hour,minute,period
Unnamed: 0_level_1,match,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,2:45pm,2,45,pm
1,0,11:30 am,11,30,am
2,0,7:00pm,7,0,pm
3,0,11:15 pm,11,15,pm
4,0,08:10 am,8,10,am
4,1,09:00am,9,0,am


# Internationalization and Issues with Non-ASCII Characters

## English and ASCII
- ASCII: American Standard Code for Information Interchange:
  - 7-bit character encoding standard: 128 valid codes;
  - Range: 0x00 - 0x7F;
  - Include alphabets, digits, ponctuations, common symbols, control characters;
  - Worked well for English typewriting;

## Written scripts encoding
- Latin 36%;
- Chinese 18%;
- Devanagari 14%;
- Arabic 14%;
...

## Other Character Encodings
- IBM EBCDIC;
- Latin-1;
- UTF-8;
- JIS: Japanese Industrial Standards;
...

## Unicode
- Industry standard for encoding and representing text;
- Over 128k characters from 130+ scripts and symbol sets;
- Can be implemented by different character endings:
  - UTF-8
  - UTF-16
  - I=UTF-32

## UTF-8
- Unicode Transformational Format - 8 bits;
- Variable length encoding: One to four bytes;
- Backward compatible with ASCII;
- Dominant character encoding for the Web;
- Default in Python 3.

In [None]:
df['text']
df.apply()

np.int64(5)

# Assignment 1

In this assignment, you'll be working with messy medical data and using regex to extract relevant infromation from the data.

Each line of the `dates.txt` file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

The goal of this assignment is to correctly identify all of the different date variants encoded in this dataset and to properly normalize and sort the dates.

Here is a list of some of the variants you might encounter in this dataset:
* 04/20/2009; 04/20/09; 4/20/09; 4/3/09
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
* Feb 2009; Sep 2009; Oct 2010
* 6/2008; 12/2009
* 2009; 2010

Once you have extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
* Assume all dates in xx/xx/xx format are mm/dd/yy
* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* Watch out for potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices. **This Series should be sorted by a tie-break sort in the format of ("extracted date", "original row number").**

For example if the original series was this:

    0    1999
    1    2010
    2    1978
    3    2015
    4    1985

Your function should return this:

    0    2
    1    4
    2    0
    3    1
    4    3

Your score will be calculated using [Kendall's tau](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient), a correlation measure for ordinal data.

*This function should return a Series of length 500 and dtype int.*

In [None]:
import pandas as pd

doc = []
with open('assets/dates.txt') as file:
    for line in file:
        doc.append(line)

df = pd.Series(doc)
df.head(10)

In [None]:
def date_sorter():
    import re
    reg = r"""(?x)
    (?:[a-z\s\(.~]|^)(\d{4})|
    (\d{1,2}/\d{4})|
    (\d{1,2}/\d{1,2}/\d{2,4})|
    ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[,.]?\s\d{1,2}[,]?\s\d{4})|
    ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*[,.]?\s\d{4})|
    (\d{1,2}\s(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*\s\d{4})|
    (\d{1,2}-\d{1,2}-\d{1,2})"""
    order = df.str.extract(reg,flags=re.VERBOSE)
    order = order[0].fillna(order[1].
                          fillna(order[2].
                                 fillna(order[3].
                                        fillna(order[4].
                                              fillna(order[5].
                                                    fillna(order[6].
                                                          fillna(order[7].
                                                                fillna(order[8].
                                                                      fillna(order[9])))))))))

    order = order.str.replace("-","/").str.strip()
    order = order.str.replace("[,.]","")
    order = order.str.replace(r"(?P<year>^\d{2,4}$)","01/01/\g<year>",regex=True)
    order = order.str.replace(r"(?P<date>\d{1,2}/\d{1,2}/)(?P<year>\d{2}$)",
                              "\g<date>19\g<year>",regex=True)
    order = order.str.replace(r"(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*","\g<month>",regex=True)
    order = order.str.replace(r"(?P<day>\d{1,2})\s(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?P<year>\d{4})","\g<month> \g<day> \g<year>",regex=True)
    order = order.str.replace(r"^(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s(?P<year>\d{4})","\g<month> 01 \g<year>",regex=True)
    order = order.str.replace(r"^(?P<month>\d{1,2})/(?P<year>)$","\g<month>/01/\g<year>",regex=True)

    dates = pd.to_datetime(order, dayfirst=False)
    idx = dates.argsort(kind='mergesort')

    return idx
