---   
<img align="left" width="110"   src="https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg"> 
<h1 align="center">Tools and Techniques for Data Science</h1>
<h1 align="center">Course: Natural Language Processing</h1>

--- 
<h2><div align="right">Muhammad Sheraz (Data Scientist)</div></h2>
<h1 align="center">Lecture 2 (Natural Language Processing Pipeline)</h1>

<img align="center" width="900"  src="images/nlp-overview.png"  > 

# Learning Agenda

1. **NLP Pipeline**
    - Data Acquisition
    - Text Preprocessing
    - Feature Engineering
    - Model Building
    - Model Evaluation
    - Deployment

# NLP Pipeline
<img align="center" width="900"  src="images/nlp-pipeline.png"  > 

# a. Data Acquisition:

<img align="right" width="600" src="images/data-acquisition.png"  >

- **Use Libraries Built-in Datasets:**
    - Seaborn: (iris, titanic, tips, flights, panguins, car_crashes)
    - Scikit-learn: (iris, digits, diaetes, ostan housing) 
    - NLTK: (movie-reviews, product_reviews, twitter_samples, gutenerg, genesis, timeit, voice, wordnet, sentiword)
    
- **Use Public Dataset Repositories:**
    - https://www.kaggle.com/
    - https://data.gov/
    - https://archive.ics.uci.edu/ml/index.php
    - https://github.com/
- **Use Company's Database:** (SQL, NOSQL, Data warehouse, Data lake)
- **Generate your own Datasets:**
    - Use Web scraping or Web API </li>
    - IoT Devices </li>
    - Crowd Sourcing (Amazon Mechanical Turk, Lionbridge AI)
    - Data Augmentation
    

<h3 align="center"><div class="alert alert-success" style="margin: 20px">Better data often beats better algorithms</h3>

# b. Text Pre-Processing

<img align="right" width="600" src="images/text-preprocessing.png"  >

- **Text Cleaning:** 
    - Removing digits and words containing digits
    - Removing newline characters and extra spaces
    - Removing HTML tags
    - Removing URLs
    - Removing punctuations
    - Handle emojis
    - Spelling correction

- **Basic Preprocessing:** 
    - Case folding
    - Expand contractions
    - Chat word treatment
    - Spelling correction
    - Tokenization and N-grams
    - Removing stop words

- **Advance Preprocessing:**
    - Stemming
    - Lemmatization
    - POS tagging
    - Parsing
    - Coreference Resolution

# c. Feature Engineering
<img align="center" width="900"  src="images/feature-engr-models.png"  > 

- Once the text pre-processing is done, we need to transform it into their features to be used for modeling. We assign numeric weights to words within our document, so that it can be fed to some machine/deep learning algorithm. This assignment of weights to words is done in such a way that the number represent the meaning of that word.

- The process of converting text data into vectors of real numbers is called `Feature Extraction from text` or `Text Representation` or `Text Vectorization`, whose goal is converting the text data into numbers in such a way that those numbers should be able to tell the semantic or meaning of those words.

- The two main categories and their sub-categories of word embeddings are:
    - **Frequency Based Word Embedding Techniques:**
        - Bag of Words (BoW)
        - Term Frequency - Inverse Document Frequency (TFIDF)
        - Global Vectors (GloVe)
    - **Prediction Based Word Embedding Techniques:**
        - Word2Vec (Google, 2013)
        -  FastText (Facebook, 2016)

# d. Model Building

<img align="center" width="800"  src="images/dl-approach.png"  > 

- The machine learning approaches can be supervised or unsupervised and can be classification or regression
    - Supervised ML Algorithms used in NLP:
        - Naïve Bayes
        - Logistic Regression
        - Support Vector Machine
        - Decision Trees
        - Random Forrest
    - Un-supervised ML Algorithms used in NLP:
        - K-Mean Clustering
        - Latent Semantic Indexing
        - Latent Dirichlet allocation (LDA)
        - Non-negative Matrix Factorization
        - Hidden Markov Model (HMMs can be trained both in an unsupervised and in a supervised fashion)


- Deep Learning models
    - CNN (Convolutional Neural Networks)
    - RNN (Recurrent Neural Networks)
    - LSTM (Long-term Short-Term Memory)
    - GRU (Gated Recurrent Unit)
    - Transformers
        - BERT (Bidirectional Encoder Representations from Transformers)
        - GPT (Generative Pre-trained Transformer)

- If the dataset has a fewer number of observations and a higher number of features, choose algorithms with high bias/low variance like Linear regression, Naïve Bayes, Linear SVM
- If the training data is sufficiently large and the number of observations is higher as compared to the number of features, one can go for low bias/high variance algorithms like KNN, Decision trees, kernel SVM

# e. Model Evaluation

<img align="center" width="500"  src="images/model-evaluation.jpeg"  > 

- Model evaluation is the process of using different evaluation metrics to understand a machine learning model's performance, as well as its strengths and weaknesses
- **Evaluation Metrics for Classification Algorithms:**
    - Confusion Matrix
    - Accuracy
    - Precision
    - Recall
    - F1-Score
    - AUC-ROC (Receiver Operator Characteristic (ROC) curve
- **Evaluation Metrics for Regression Algorithms:**
    - Mean Absolute Error (MAE)
    - Mean Squared Error (MSE)
    - Root Mean Squared Error (RMSE)
    - R-Squared (coefficient of determination)
    - Adjusted R-Squared
>- For details read this blog: https://blog.knoldus.com/model-evaluation-metrics-for-machine-learning-algorithms/

# f. Model Deployment

<img align="center" width="800"  src="images/deployment.png"  > 
