# Introduction to Natural Language Processing (NLP)

## Definition of NLP

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and linguistics, and cognitive science that focuses on the interaction between computers and human languages. 

It aims to develop algorithms and models that enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful. 

## Applications of NLP

some common applications of NLP are:

- Part-of-speech (POS) tagging is the process of assigning grammatical categories (e.g., noun, verb, adjective) to each word in a text.

- Sentiment analysis: Determining the sentiment or emotion behind a piece of text (e.g., positive, negative, or neutral).

- Text classification: Categorizing text into predefined classes (e.g., spam detection, topic classification).

- Machine translation: Automatically translating text from one language to another.

- Information retrieval: Searching and retrieving relevant information from large datasets (e.g., search engines).

- Named entity recognition: Identifying and classifying proper names, dates, organizations in a text.

- Text summarization: Generating a concise summary of a longer text while preserving the main ideas.

- Question answering: Providing accurate answers to questions posed by users.

- Dialogue systems: Building chatbots or virtual assistants that can engage in conversation with users.

- Speech recognition: Converting spoken language into written text.

<img src='https://static.packt-cdn.com/products/9781789536089/graphics/assets/0bd4c08a-3dbf-4d48-9fb8-d7800c486b63.png' />

- sequential data are not limited to text: time series, DNA sequence, weather, audio, video

- seq oriented tasks have **variable length inputs, variable length outputs, variable length computation**, must remember history of input

## challenges in NLP

Ambiguity at various levels: lexical, syntactic, or semantic, pragmatic (context)

morphological challenges: typos, Inconsistencies of compound words( junior college, college junior, pet spray, pet llama).

lexical challenge (noisy text): slang, idioms, metaphors, noval words (selfie 自拍, chillax 放松), and other non-literal expressions (A360, 7342.67).

Syntactic challenges: complex sentences (multiple clauses or components), parsing compound words or multi-word expressions

Semantic challenges: Subjectivity (not fact), negation, counterfact or hypothesis.

Pragmatic challenges: humor and sarcasm 幽默和讽刺, implicature, inference, world knowledge, and differentiating between literal and intended meanings.

Evolving language: Languages constantly evolve, with new words and phrases emerging regularly.

intrinsic complexity of human languages: language is challenging even for human learners, both for first and second languages. 

Multilinguality: Developing NLP systems that work well across multiple languages.

## History of NLP



1. **Foundational insights (1940s-1950s)**: Early work by Turing (automaton), Shannon (information theory, noisy channel), and Backus & Naur (formal languages) laid the groundwork for NLP.

2. **Two camps (1957-1970)**: Symbolic (Chomsky's Transformation Grammar, Artificial Intelligence) and stochastic (Bayesian reasoning, corpus work by Kučera & Francis) approaches emerged.

3. **Four paradigms (1970-1983)**: Stochastic (IBM), logic-based (Colmerauer et al.), natural language understanding (Winograd, Schank, Fillmore), and discourse modeling (Grosz & Sidner).

4. **Empiricism and finite-state models redux (1983-1993)**: Revival of empirical and finite-state models (Kaplan & Kay for phonology and morphology, Church for syntax).

5. **Late years (1994-2010)**: Integration of techniques, expansion to speech and information retrieval (IR), probabilistic models, machine learning, structured prediction, and topic models.

6. **recent years (2010-)**: ML and DL techniques 

    **Machine learning methods**: Support Vector Machines (SVM), logistic regression (maxent), Conditional Random Fields (CRFs)
    
    shared tasks: TREC (Text Retrieval Conference), DUC (Document Understanding Conference), TAC (Text Analysis Conference), SEM (Conference on Semantic Computing and Computational Semantics)

    **Semantic tasks**: Recognizing Textual Entailment (RTE), Semantic Role Labeling (SRL).

    **Semi-supervised and unsupervised methods**: Zero-shot learning, transfer learning, self-training.

    **Deep Learning**: Embeddings, Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN), Attention, Generative Adversarial Networks (GANs), Transformers, BERT.

## benchmark dataset

| Name                | Full Name                                      | Task                        | Short Description                                                   |
|---------------------|------------------------------------------------|-----------------------------|---------------------------------------------------------------------|
| Penn Treebank (PTB) | Penn Treebank                                  | Syntactic parsing, POS tagging, Language modeling | Annotated corpus of English text for evaluating syntactic parsing and other tasks |
| CoNLL-2003          | Conference on Computational Natural Language Learning 2003 | Named Entity Recognition    | Dataset for identifying and classifying named entities in text      |
| TREC Question Classification | Text REtrieval Conference Question Classification | Question Classification | Dataset for categorizing questions into different classes          |
| SQuAD               | Stanford Question Answering Dataset           | Reading Comprehension       | Dataset for answering questions based on a given context            |
| IMDb                | Internet Movie Database                       | Sentiment Analysis          | Dataset of movie reviews labeled as either positive or negative     |
| SNLI                | Stanford Natural Language Inference           | Natural Language Inference  | Dataset for determining the relationship between two sentences      |
| MS MARCO            | Microsoft MAchine Reading COmprehension       | Machine Reading Comprehension | Large-scale dataset with real user queries and human-generated answers |
| SemEval             | Semantic Evaluation                           | Various Semantic Analysis Tasks | Annual shared tasks evaluating models on different semantic analysis tasks |


## genres of NLP text

Electronic Health Records (EHR)

Online Repositories: Project Gutenberg, arXiv, JSTOR, PubMed

Web Data: Web scraping, News articles, Blogs and personal websites

Social Media Platforms: Twitter, Facebook, Reddit, Instagram

Online Forums and Discussion Boards: Stack Exchange, Quora

Published Works: Books, Research papers, Scholarly articles

Corporate Documents: Company reports, Press releases, Internal documents

Government and Legal Texts

Conversational Data: Chat logs, Interview transcripts, Speech transcripts

Customer Reviews and Feedback: Amazon reviews, Yelp reviews, TripAdvisor reviews

Email and Messaging Data

OCR-generated Text


## Research in NLP

NLP (Natural Language Processing):

- ACL (Association for Computational Linguistics)

- EMNLP (Conference on Empirical Methods in Natural Language Processing)

- NAACL (North American Chapter of the Association for Computational Linguistics)

ML (Machine Learning):

- NeurIPS (Conference on Neural Information Processing Systems)

- ICML (International Conference on Machine Learning)

- ICLR (International Conference on Learning Representations)

DL (Deep Learning): same conferences as ML and CV, such as NeurIPS, ICML, ICLR, CVPR, and ECCV.

RL (Reinforcement Learning):

- AAMAS (International Conference on Autonomous Agents and Multi-Agent Systems)

- ICRA (International Conference on Robotics and Automation)

- IROS (International Conference on Intelligent Robots and Systems)

CV (Computer Vision):

- CVPR (Conference on Computer Vision and Pattern Recognition)

- ICCV (International Conference on Computer Vision)

- ECCV (European Conference on Computer Vision)

## visualization tool

**Text Classification and Sentiment Analysis:**
- [AllenAI Demos](https://allenai.org/demos/)

**Named Entity Recognition (NER):**
- [AllenNLP Named Entity Recognition Demo](https://demo.allennlp.org/named-entity-recognition)

**Reading Comprehension:**
- [AllenNLP Reading Comprehension Demo](https://demo.allennlp.org/reading-comprehension)

**Parsing and Syntax Analysis:**
- [CKY Parsing Demo](http://lxmls.it.pt/2015/cky.html)

**Text Generation and Language Modeling:**
- [TensorFlow Playground](https://playground.tensorflow.org/)
- [Interactive Machine Learning List](https://p.migdal.pl/interactive-machine-learning-list/)

**Text Similarity and Sequence Alignment:**
- [Levenshtein Distance Demo](http://www.let.rug.nl/~kleiweg/lev/)
- [Biological Sequence Alignment (Pairwise Align DNA)](http://www.bioinformatics.org/sms2/pairwise_align_dna.html)
- [Animal Genome Sequence Formats](http://www.animalgenome.org/bioinfo/resources/manuals/seqformats)

**Neural Network Visualization:**
- [Neural Network Manifolds and Topology](http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/)
- [ConvNetJS](https://cs.stanford.edu/people/karpathy/convnetjs/)
- [ConvNetJS MNIST Demo](https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html)

**Transformer Model Visualization:**
- [BERTViz](https://github.com/jessevig/bertviz)
- [BERT Visualization Colab 1](https://colab.research.google.com/drive/1c73DtKNdl66B0_HF7QXuPenraDp0jHRS)
- [BERT Visualization Colab 2](https://colab.research.google.com/drive/1PEHWRHrvxQvYr9NFRC-E_fr3xDq1htCj#scrollTo=fZAXH7hWyt58)

**Miscellaneous:**
- [Explosion AI Demos (spaCy and Prodigy)](https://explosion.ai/demos/)
- [Stanford CoreNLP](https://corenlp.run/)
