# CS481: Intelligence Text Analysis and Knowledge Mining

# About me
- Zhao Wang <br>
- Research: 
     - Natural Language Processing <br>
     > Estimating the effect of word choice on audience perception <br>
     - Machine Learning <br>
     > Spurious correlations <br>
     > Improve text classifier robustness <br>
     - Social Media Analysis <br>
     > Deceptive public messaging <br>
     - Online communication platforms:
     > Twitter, Yelp, Airbnb, IMDB, League of Legends

# About the course
- Practical introduction to NLP
- Learn by example and real world tasks
- A balance between theoretical and practical sides
- Linguistic analysis and computational methods for NLP


## Github repository
- https://github.com/iit-cs481/main

## Schedule

## Exams and quizes

## Softwares and packages
- recommend to install **anaconda**
- make sure you use **python3** and **nltk3**


### Why Python?
- Python:
    - Interpreted / Readable Programming Language <br>
    - Has **a huge set of libraries that make it fit for AI, datascience**, etc. <br>
    - Simple and powerful for processing language data <br>
    - Short code, 3-4 times less than java <br>
    - Shallow learning curve <br>
    - Platform independent <br>
    - Success stories in real world applications: https://www.python.org/about/success/

- Java:
    - Has libraries support for graphical interfaces <br>
        - Software development <br>
        - Mobile apps <br>
    - Java has quite huge code <br>
    - Platform / OS independent <br>

- C++:
    - A fast compiled programming language <br>
        - time requirement <br>
        - connect to hardware, e.g., EE <br>
    - Has limited number of library support <br>
    - Code length is less than java <br>
    - Platform / OS dependent <br>

### Why NLTK?
- http://nltk.org/
- Practical knowledge of NLP
- Data corpora and lexical resources

- APIs and documentation:
> defines standard interfaces for performing NLP tasks (e.g.,POS tagging, syntactic parsing) <br>
> covers every module, class, and function <br>
> specifying parameters and showing examples of usage <br>

- created in 2001
> computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania <br>
> Release the 1st version in February 2005 <br>


In [76]:
import nltk

# check nltk version is 3
nltk.__version__ 

'3.3'

In [None]:
# from command line: nltk.download()
# nltk.data.path.append("/data/3/zwang/nltk_data") # setting environment variable to your path

- nltk_data installation: https://www.nltk.org/data.html
- Be careful of environment variables

In [86]:
# test if the nltk_data is successfully installed
from nltk.corpus import brown
from nltk.book import *

In [87]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [81]:
print(text1[:50])

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar', 'School', ')', 'The', 'pale', 'Usher', '--', 'threadbare', 'in', 'coat', ',', 'heart', ',', 'body', ',', 'and', 'brain', ';', 'I', 'see', 'him', 'now', '.', 'He', 'was', 'ever', 'dusting', 'his', 'old', 'lexicons', 'and']


# What is Natural Language? 
- Natural language
> Emerges from intelligent beings <br>
> For human communication <br>
> Full of ambiguity (e.g., meet me at the bank) <br>
> Hard to specify with explicit rules <br>
> We **discover** the grammar. <br>
> English, Spanish, Chinese, Dolphin Language? <br>

- Unnatural / Artificial / Formal language
> Defined by humans <br>
> We prescribe the grammar <br>
> Designed to remove ambiguity <br>
> Programming language like Python, Java, C++... <br>
> Mathematical notations ... <br>

## What is Natural Language Processing (NLP)?
- NLP is the set of methods for making natural **language** accessible to **computers**. -- by Jacob Eisenstein
- A branch of AI that deals with the interaction between **computers** and humans using the natural **language**.


- Computational Linguistics
    - Apply computational methods to make computers understand natural language <br>
    - Model the structure and meaning of language in a way that computers can understand <br>
    - Counting word frequencies in different docs to compare their writing styles <br>
        - Twitter, short, unformal, emoji, abbrevations, ... <br>
        - News report, long, formal, time, location, people, ...<br>
        - Thesis, long, formal, definitions, notations, inference, prove, ...<br>
    - Understanding human language <br>
        - Information extraction <br>
        - Text summarization <br>
        - Language generation <br>
        - Communicate with computers <br>
        - Build intelligent machines to understand natural language<br>


- Interdisciplinary: 
    - Artificial Intelligence
    - Machine learning, statistical models, neural networks
    - Linguistics, representations of language
    - Social science, data and application in social issues
    - Mathematics, Statistics, Psychology, etc... <br>


- NLP+X
    - computational social science
    - computational humanities
    - ...

# The big picture of NLP

## Real world applications 

### handwriting recognition 
    - character recognition
    - misspelling words
    - sentence reasonableness
<img src="./handwriting.png" width="200">

### Grammar checking (Grammarly)
    - singular vs plural
    - language model
<img src="./grammarly.png" width="800">

### Search engines and predictive text
    - The most possible strings
    - Information extraction
    - The most relevant answer to the query
    
<img src="./predictive_text.png" width="800">

### Product recommendation 
    - Items with most similar attributes and contents
   

<img src="./search_engine.png" width="800">

### Machine Translation

    - difficult because: 
         - a given word could have several possible translations (depending on its meaning), 
         - word order could change in keeping with the grammatical structure of the target language.
         
    - solutions:
        - collect massive quantities of parallel texts in two or more languages
        - text alignment to automatically pair up the sentences
        - detect corresponding words and phrases
        - train translation models to tranlate new texts

<img src="./machine_translation.png" width="800">


### User attribute analysis in tweets and blogs (e.g., gender, age, sentiment)
    - text preprocessing (cleaning, remove emojis, abbrevations)
    - attribute keywords
    - text classification
    
<img src="./user_attri.png" width="600">



### Chat-bot
    - question answering system
    - understand human language in specific context
    - generate "natural language" to answer user's questions

<img src="./chat_bot.png" width="300">

### Other domains:
    - scientific, economic, social, cultural
    - Industry: HCI, information analysis, machine translation, information extraction
    - Academia: computational linguistics, computer science, artificial intelligence,machine learning
    - Job search

## Why is NLP hard? 
- Language is a complex social process
- Tremendous ambiguity at every level

### Word sense disambiguation

- Meet me at the **bank**.

    - bank: the organization that provides finalcial services <br>
    - bank: the side of a river

- Which sense of a word was intended in a given context?

### Pronoun Resolution

1. The thieves stole the paintings. **They** were subsequently sold.

2. The thieves stole the paintings. **They** were subsequently caught.

3. The thieves stole the paintings. **They** were subsequently found.


- Who did what to whom?
    - identifying what a pronoun or noun phrase refers to
    - semantic role labeling — identifying how a noun phrase relates to the verb (as agent, patient, instrument, and so on).

### Spoken Dialog Systems
- commonly assumed pipeline for NLP
    - Analyze spoken input
    - recognize words
    - parse sentences
    - interpreted in context
    - application-specific actions (e.g., generating answers to a question, translation into another language)
    - generate responses from computer aspect
    - realized as a syntactic structure
    - represent with suitable words 
    - generate spoken output <br>

- different types of linguistic knowledge inform each stage of the process:
    - phonology
    - morphology
    - syntax
    - semantics
    - reasoning

<img src="./dialog.png" width="600">

# Questions for you
How many of you
- ... are CS majors?
- ... graduate vs under-graduate?
- ... familiar with python?
- ... have taken probability and statistics?
- ... have taken machine learning / data mining / social media analysis?

**Course survey**:

    - https://forms.gle/nukxxPCfzD5P6cDD7