<a href="https://colab.research.google.com/github/SoumyaCO/Computer-Vision-Fashion-Mnist/blob/main/Natural_Language_Processing_with_Tensorflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🗣️ Natural Language Processing with tensorflow

Matrials for this courses:

* 🔗 [Course Link](https://www.coursera.org/learn/natural-language-processing-tensorflow/)
* 🌐 [Github Repository](https://github.com/https-deeplearning-ai/tensorflow-1-public/tree/main)
* 🤖 [Tensorflow Documentation](https://www.tensorflow.org/tutorials/text)

### Contents:
* Introduction
* Word based Encodings
* Texts to sequences
* Padding
* Sarcasm
* Assignment[BBC-news archieve]

#### Introdution: {#introdution}

In [11]:
# Code goes here

#### Word based Encodings

In [12]:
# Code Goes here

#### Texts to Sequences

In [13]:
# Code Goes here

#### Padding

In [14]:
# Code goes here

#### Sarcasm
**Data Info**:
This is a dataset from kaggle.
It contains some news article headlines with their lins and a label determining wheteher the headline is sarcastic.

**Goal**: we have to analyze the data and have to predict which one of the heading is sarcastic and which one is not.
***

**🔑 Approach**
> * Download the data from 🔗 [here](https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json)
> * loading the data from json format to a python dictionary format
> * storing the **headlines**, **labels** and **links** into seperate python lists
> * turn the headline into padded sequences

In [15]:
# Download the dataset:
!wget https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json

--2023-09-09 14:16:58--  https://storage.googleapis.com/tensorflow-1-public/course3/sarcasm.json
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.170.207, 142.251.175.207, 74.125.24.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.170.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5643545 (5.4M) [application/json]
Saving to: ‘sarcasm.json.1’


2023-09-09 14:17:00 (4.86 MB/s) - ‘sarcasm.json.1’ saved [5643545/5643545]



In [16]:
# Import JSON and load the file in a seperate file:
import json

# Load json file
with open("./sarcasm.json", 'r') as f: # the r parameter is to open the file in read-only mode
    datastore = json.load(f)

In [17]:
# Non-sarcastic headline
print(f"Not Sarcastic:{datastore[0].get('headline')}")

# sarcastic headline
print(f"Sarcastic:{datastore[20000].get('headline')}")

Not Sarcastic:former versace store clerk sues over secret 'black code' for minority shoppers
Sarcastic:pediatricians announce 2011 newborns are ugliest babies in 30 years


***
Collect the headlines and labels and URLs, though URLs are of no use in this problem scope.
***

In [18]:
# initialize the lists
headlines = []
labels = []
# though not needed but we're storing the links as well
links = []

for item in datastore:
    headlines.append(item.get('headline'))
    labels.append(item.get('is_sarcastic'))
    # links:
    links.append(item.get('article_link'))

***
We'll now convert the list of sentences into padded sequences.
The code cell below generates a dictionary of all all the unique words with values (numerical) and also a list of padded sequences for each of the 26,709 headlines

In [19]:
# Importing necessary libarary
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Initializing tokenizer
tokenizer = Tokenizer(oov_token = "<OOV>")

# Generate the word_index dictionary
tokenizer.fit_on_texts(headlines)

# length of the word index
word_index = tokenizer.word_index
print(f"number of words in word index: {len(word_index)}")

# word index
# print(f"word index: {word_index}")
# print()

# getting seqeunces (list) from the headlines
sequences = tokenizer.texts_to_sequences(headlines)
padded = pad_sequences(sequences, padding='post')

# Print a sample headline
index = 2
print(f"sample headline: {headlines[index]}")
print(f"padded sequence: {padded[index]}")

# Print dimensions of the padded sequences
print(f"shape of the paddded sequences: {padded.shape}")


number of words in word index: 29657
sample headline: mom starting to fear son's web series closest thing she will have to grandchild
padded sequence: [  145   838     2   907  1749  2093   582  4719   221   143    39    46
     2 10736     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0]
shape of the paddded sequences: (26709, 40)


***
***
# <font color='orange'>ASSIGNMENT:1 Explore the BBC News Archieve</font>

**💡 <u>Data Info</u>**:
Data contains heading from BBC news archieve, the headings are in one of the 5 categories: -
* Business
* Entertainment
* Politics
* Sport
* Tech
each headline has a label defining to which category the headline belongs

**🎯 <u>Goal</u>**: To classify a article/heading based on its category by understanding its relevance and topic.

**🔑  <u>Approach:</u>**
> *

In [None]:
# RUN THESE CELLS AFTER RECONNECTING...
# FROM HERE ==================================================
# Download the data from kaggle:
# from google.colab import files
# files.upload()

In [None]:
# Creating kaggle config.
# !mkdir ~/.kaggle
# !mv kaggle.json ~/.kaggle/
# !chmod u+x ~/.kaggle/kaggle.json

In [None]:
# downloading the data as zip file
!kaggle competitions download -c learn-ai-bbc

In [13]:
# unzipping the file from
!unzip learn-ai-bbc.zip -d data

# TO HERE ======================================================

Archive:  learn-ai-bbc.zip
  inflating: data/BBC News Sample Solution.csv  
  inflating: data/BBC News Test.csv  
  inflating: data/BBC News Train.csv  


#### Loading the data and get started.

In [19]:
# IMPORT MODULES
import csv
import os
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [20]:
# LOOKING AT THE STRUCTURE
training_data_path = os.path.join('./data', "BBC News Train.csv")
with open(training_data_path, 'r') as csvfile:
    print(f"first line(header) looks like this:\n\n{csvfile.readline()}")
    print(f"Each data point looks like this:\n\n{csvfile.readline()}")

first line(header) looks like this:

ArticleId,Text,Category

Each data point looks like this:




#### Remove stopwords
**Stopwords** are the most common words in the language and they rarely provide useful information for the classification process

In [21]:
# REMOVE STOPWORD FUNCTION
def remove_stopwords(sentence):
    """
    removes a list of stopwords

    Args:
        sentence (string): sentence to remove the stopwords from
    Returns:
        sentence (string): lowercase sentence without stopwords
    """
    # List of stopwords
    stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "i", "you", "to", "the"]


    # sentence converted to lowercase
    sentence = sentence.lower()

    # removing the stopwords
    words = []
    for word in sentence.split(" "):
        if word not in stopwords:
            words.append(word)
    sentence = " ".join(words)

    return sentence

In [22]:
# TESTING THE FUNCTION
remove_stopwords("I am about to go to the store and get any snack")

'go store get snack'

#### Reading the raw data
Now we have to read the data from the csv file. <font color='orange'>To do so - </font>
> * Omit the first line at it contains the headers and not data points
> * No need to store the datapoints as **numpy arrays** regular list is fine
> * to read from csv file use `csv.reader` by passing the appropeate arguments
> `csv.reader` returns an iterable that returns each row in every iteration, So the label can be accessed via `row[0]` and the text via `row[1]`
> Use `remove_stopwords` function for each sentences

In [33]:
# PARSE DATA FROM FILE:
def parse_data_from_file(filename: str)->tuple:
    """
    Extracts sentences and labels from a CSV file

    Args:
        filename (string): path to the CSV file
    Returns:
        sentences, labels (list of string, list of string): tuple containing lists of sentences and labels
    """
    sentences = []
    labels = []
    with open(filename, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter = ',')
        next(reader) # skip a row
        for article_info in reader:
            sentences.append(remove_stopwords(article_info[1])) # article was in the 2nd index of the list
            labels.append(article_info[2]) # article label is in the third index of the list
    return sentences, labels

In [32]:
# Testing function
# With original
sentences, labels = parse_data_from_file(training_data_path)

print("ORIGINAL DATASET:\n")
print(f"There are {len(sentences)} sentences in the dataset.\n")
print(f"First sentence has {len(labels)} labels in the dataset.\n")
print(f"The first 5 labels are {labels[:5]}\n\n")

# With a mniature version of the dataset that contains only first 5 rows
# mini-sentences, mini_labels = parse_data_from_file()
# I don't have the mini batch data.
# We'll see

ORIGINAL DATASET:

There are 1490 sentences in the dataset.

First sentence has 1490 labels in the dataset.

The first 5 labels are ['business', 'business', 'business', 'tech', 'business']




#### Using Tokenizer

fatal: not a git repository (or any of the parent directories): .git
