<a href="https://colab.research.google.com/github/dswh/lil_nlp_with_tensorflow/blob/main/01_05_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis - Tokenizing news headlines for data preparation!
The notebook covers the data preparation step by tokenizing the headlines and creating padded sequences of news headlines.

Data preparation include the following steps:
1. Download and read the data
2. Segregate the headlines and their labels.
3. Tokenize the headlines
4. Create sequences and add padding.

## 1. Download and read the news headlines data

This is a [kaggle dataset](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection) which is further corrected and then hosted on Google Cloud Storage.

In [None]:
!wget --no-check-certificate \
    https://storage.googleapis.com/wdd-2-node.appspot.com/x1.json \
    -o /tmp/headlines.json

In [None]:
##read the data using the pandas library
import pandas as pd

data = pd.read_json("./x1.json")
data.head()

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...


## Segregating the headlines

In [None]:
##create lists to store the headlines and labels
headlines = list(data['headline'])
labels = list(data['is_sarcastic'])

## Import the APIs

In [None]:
##import the required APIs
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


## 3. Tokenize the data

In [None]:
##set up the tokenizer
tokenizer = Tokenizer(oov_token="<oov>")
tokenizer.fit_on_texts(headlines)

In [None]:
word_index = tokenizer.word_index
print(word_index)




## 4. Create padded sequences

In [None]:
##create sequences of the headlines
seqs = tokenizer.texts_to_sequences(headlines)

##post-pad sequences
padded_seqs = pad_sequences(seqs, padding="post")


In [None]:
##printing padded sequences sample
print(padded_seqs[0])

[16004   355  3167  7474  2644     3   661  1119     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0]
