## Scenario: Analyzing and Segregating News Headlines for **Sarcasm Detection**

Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets.

To overcome the limitations related to noise in Twitter datasets, this News Headlines dataset for Sarcasm Detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events and we collected all the headlines from News in Brief and News in Photos categories (which are sarcastic). We collect real (and non-sarcastic) news headlines from HuffPost.



### **Dataset Description:**

The data set contains the following attributes:

- **is_sarcastic**: 1 if the record is sarcastic otherwise 0

- **headline**: the headline of the news article

- **article_link**: link to the original news article. Useful in collecting supplementary data

### **Tasks to be performed:**

- Download the data set from Dropox and install dependencies
- Import required libraries and load the dataset
- Perform Exploratory Data Analysis (EDA)
 - Analyze the data using **Pandas Profiling** and record your observations
 - Use **Sweetviz** to visualize the columns present in the data set
 - Analze the target variable **is sarcastic**
- Implement Text Pre-processing 
- Impelement TF-IDF Vectorizer
- Split the data set into training and testing set using **train_test_split** function from sklearn
- Model Building 
 - Bernoulli Classifier
- Model Evaluation




In [2]:
#importing the libraries and load the dataset
import spacy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore", category = FutureWarning)

print("Libraries Imported..")

Libraries Imported..


In [3]:
# Reading the dataset
df =  pd.read_json(r"C:\Users\Shivani Dussa\Downloads\Sarcasm_Headlines_Dataset.json",lines = True)

In [4]:
print(df.shape)
df.head()           

(26709, 3)


Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [5]:
# These dataset tells us that headlines wheather it is sarcastic or not
# sarcastic headlines are 1 and un sarcastic haedlines are 0 

### Exploratory Data Analysis

### Analysing the data using pandas profiling

In [None]:
!pip install pandas-profiling

In [5]:
df.columns

Index(['article_link', 'headline', 'is_sarcastic'], dtype='object')

In [6]:
df.is_sarcastic.value_counts()         # 14985 are non sarcatsic and 11724 are sarcastic looks like balanced dataset

0    14985
1    11724
Name: is_sarcastic, dtype: int64

In [7]:
print('Percentages for is_sarcastic values')
df.is_sarcastic.value_counts() * 100/df.shape[0]    # Non- sarcastic = 56,sarcastic = 43

Percentages for is_sarcastic values


0    56.104684
1    43.895316
Name: is_sarcastic, dtype: float64

### Checking for null values

In [8]:
df.isnull().sum()

article_link    0
headline        0
is_sarcastic    0
dtype: int64

In [9]:
df.head()

Unnamed: 0,article_link,headline,is_sarcastic
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0


In [10]:
# creating the headline number of words
df['num_words'] = df['headline'].apply(lambda x: len(str(x).split()))
df.head()

Unnamed: 0,article_link,headline,is_sarcastic,num_words
0,https://www.huffingtonpost.com/entry/versace-b...,former versace store clerk sues over secret 'b...,0,12
1,https://www.huffingtonpost.com/entry/roseanne-...,the 'roseanne' revival catches up to our thorn...,0,14
2,https://local.theonion.com/mom-starting-to-fea...,mom starting to fear son's web series closest ...,1,14
3,https://politics.theonion.com/boehner-just-wan...,"boehner just wants wife to listen, not come up...",1,13
4,https://www.huffingtonpost.com/entry/jk-rowlin...,j.k. rowling wishes snape happy birthday in th...,0,11


In [11]:
df.loc[0,'headline']

"former versace store clerk sues over secret 'black code' for minority shoppers"

In [12]:
maxWords = df['num_words'].max()
print("maximum no of words:",maxWords)

maximum no of words: 39


In [13]:
maxtext =  df[df['num_words'] == 39]
maxtext

Unnamed: 0,article_link,headline,is_sarcastic,num_words
15247,https://www.theonion.com/elmore-leonard-modern...,"elmore leonard, modern prose master, noted for...",1,39


In [14]:
df.loc[15247,'headline']

'elmore leonard, modern prose master, noted for his terse prose style and for writing about things perfectly and succinctly with a remarkable economy of words, unfortunately and sadly expired this gloomy tuesday at the age of 87 years old'

In [15]:
text = df[df['num_words'] == maxWords]['headline'].values

In [16]:
print(type(text))
text

<class 'numpy.ndarray'>


array(['elmore leonard, modern prose master, noted for his terse prose style and for writing about things perfectly and succinctly with a remarkable economy of words, unfortunately and sadly expired this gloomy tuesday at the age of 87 years old'],
      dtype=object)

In [17]:
print(text[0])

elmore leonard, modern prose master, noted for his terse prose style and for writing about things perfectly and succinctly with a remarkable economy of words, unfortunately and sadly expired this gloomy tuesday at the age of 87 years old


## Text Pre-proccesing

#### Word tokenize
- A sentence or data split into words is called word tokenize

In [19]:
#word tokenize
nlp = spacy.load('en_core_web_sm')
tokenCollection = nlp(text[0])

#list compression method to get tokens
tokenList = [token.text for token in tokenCollection]
print(tokenList)

['elmore', 'leonard', ',', 'modern', 'prose', 'master', ',', 'noted', 'for', 'his', 'terse', 'prose', 'style', 'and', 'for', 'writing', 'about', 'things', 'perfectly', 'and', 'succinctly', 'with', 'a', 'remarkable', 'economy', 'of', 'words', ',', 'unfortunately', 'and', 'sadly', 'expired', 'this', 'gloomy', 'tuesday', 'at', 'the', 'age', 'of', '87', 'years', 'old']


#### Punctuation
Spacy library contains different punctuations, such as **Quotes, currency, punctuation** etc,
In above sentence we have seen inveted comma punctuation in the sentence and it will be considered as new word token, which is not usefull for our analysis. So we will remove that punctuation from sentence.

In [20]:
# Data Preproccesing
# Removing punctuation
print('Quotes:',spacy.lang.punctuation.LIST_QUOTES)     # These are quotes

Quotes: ["\\'", '"', '”', '“', '`', '‘', '´', '’', '‚', ',', '„', '»', '«', '「', '」', '『', '』', '（', '）', '〔', '〕', '【', '】', '《', '》', '〈', '〉']


In [21]:
print('punctuations:',spacy.lang.punctuation.LIST_PUNCT)    # These are Punctuations

punctuations: ['…', '……', ',', ':', ';', '\\!', '\\?', '¿', '؟', '¡', '\\(', '\\)', '\\[', '\\]', '\\{', '\\}', '<', '>', '_', '#', '\\*', '&', '。', '？', '！', '，', '、', '；', '：', '～', '·', '।', '،', '۔', '؛', '٪']


In [22]:
print('\n Currency:',spacy.lang.punctuation.LIST_CURRENCY)


 Currency: ['\\$', '£', '€', '¥', '฿', 'US\\$', 'C\\$', 'A\\$', '₽', '﷼', '₴', '₠', '₡', '₢', '₣', '₤', '₥', '₦', '₧', '₨', '₩', '₪', '₫', '€', '₭', '₮', '₯', '₰', '₱', '₲', '₳', '₴', '₵', '₶', '₷', '₸', '₹', '₺', '₻', '₼', '₽', '₾', '₿']


In [25]:
#list  of punctuation contains most of punctuation, we will use only that for our analysis
punct = [token.text for token in tokenCollection if token.is_punct]
print('Punctuation:',punct)

Punctuation: [',', ',', ',']


### Stopwords

In [37]:
# we will remove stopwords in dataset
stopwords = list(spacy.lang.en.stop_words.STOP_WORDS)
print('Number of stopwords is:','-'*20,len(stopwords))       #-*20 means its to print a line -----in a shortcut -*20
print('Ten stop words:',list(stopwords[:10]))

stop = [token.text for token in tokenCollection if token.is_stop]
print('\n','*'*120,'\nStop Word in sentence:',stop)

Number of stopwords is: -------------------- 326
Ten stop words: ['onto', 'might', "'ll", 'may', 'did', 'her', '’ve', 'its', 'his', 'meanwhile']

 ************************************************************************************************************************ 
Stop Word in sentence: ['for', 'his', 'and', 'for', 'about', 'and', 'with', 'a', 'of', 'and', 'this', 'at', 'the', 'of']


### Digit 

In [39]:
digit = [token.text for token in tokenCollection if token.is_digit]
print('digits in sentence:',digit)

digits in sentence: ['87']


## Lemmatizing
- Lemmetiztion is the process of retrieving the root word of the current word. Lemmatization is an essential process in NLP to bring different variants of a single word to one root word.

In [40]:
lemma = [token.lemma_ for token in tokenCollection]
print(lemma)

['elmore', 'leonard', ',', 'modern', 'prose', 'master', ',', 'note', 'for', 'his', 'terse', 'prose', 'style', 'and', 'for', 'write', 'about', 'thing', 'perfectly', 'and', 'succinctly', 'with', 'a', 'remarkable', 'economy', 'of', 'word', ',', 'unfortunately', 'and', 'sadly', 'expire', 'this', 'gloomy', 'tuesday', 'at', 'the', 'age', 'of', '87', 'year', 'old']


## Named Entities
- A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title.

In [47]:
spacy.displacy.render(tokenCollection, style = 'ent',jupyter = True)

In [64]:
newdf = pd.DataFrame(
{
    'token': [w.text for w in tokenCollection ],
    'lemma': [w.lemma_ for w in tokenCollection],
    'POS'  : [w.pos_ for w in tokenCollection],
    'TAG'  : [w.tag_ for w in tokenCollection],
    'DEP' : [w.dep_ for w in tokenCollection],
    'is_stopword' : [w.is_stop for w in tokenCollection],
    'is_punctuation' : [w.is_punct for w in tokenCollection],
    'is_digit' : [w.is_digit for w in tokenCollection]
})

def highlight_True(s):
    """
    Highlight True and False
    """
    return ['background-color:orange' if v else '' for v in s]
newdf.style.apply(highlight_True,subset = ['is_stopword','is_punctuation','is_digit'])

Unnamed: 0,token,lemma,POS,TAG,DEP,is_stopword,is_punctuation,is_digit
0,elmore,elmore,PROPN,NNP,advmod,False,False,False
1,leonard,leonard,PROPN,NNP,nsubj,False,False,False
2,",",",",PUNCT,",",punct,False,True,False
3,modern,modern,ADJ,JJ,amod,False,False,False
4,prose,prose,NOUN,NN,compound,False,False,False
5,master,master,NOUN,NN,appos,False,False,False
6,",",",",PUNCT,",",punct,False,True,False
7,noted,note,VERB,VBD,ROOT,False,False,False
8,for,for,SCONJ,IN,mark,True,False,False
9,his,his,PRON,PRP$,poss,True,False,False


## Cleaning the data

In [68]:
def clean_text(df):
    
    nlp = spacy.load('en_core_web_sm')
    for i in range(df.shape[0]):
        tokenCollection = nlp(df['headline'][i])
        tokenList = [token.lemma_.lower().strip() for token in tokenCollection
                if not (token.is_stop | token.is_punct | token.is_digit)]
        text = " ".join(tokenList)
        
        if i <5: print('Sentence:',i,text)
        df['headline'][i] = text
        return df

In [66]:
df.shape

(26709, 4)

In [69]:
news_df = clean_text(df)
print(news_df)

Sentence: 0 versace store clerk sue secret black code minority shopper
                                            article_link  \
0      https://www.huffingtonpost.com/entry/versace-b...   
1      https://www.huffingtonpost.com/entry/roseanne-...   
2      https://local.theonion.com/mom-starting-to-fea...   
3      https://politics.theonion.com/boehner-just-wan...   
4      https://www.huffingtonpost.com/entry/jk-rowlin...   
...                                                  ...   
26704  https://www.huffingtonpost.com/entry/american-...   
26705  https://www.huffingtonpost.com/entry/americas-...   
26706  https://www.huffingtonpost.com/entry/reparatio...   
26707  https://www.huffingtonpost.com/entry/israeli-b...   
26708  https://www.huffingtonpost.com/entry/gourmet-g...   

                                                headline  is_sarcastic  \
0      versace store clerk sue secret black code mino...             0   
1      the 'roseanne' revival catches up to our thorn...    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['headline'][i] = text


### Implement TFIDF Vectorizer

In [72]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(analyzer = 'word',ngram_range = (1,3),max_features = 3000)   # there are 26k headlines we are taking 3k
X = tf.fit_transform(news_df['headline'])

In [74]:
X.shape,type(X)

((26709, 3000), scipy.sparse.csr.csr_matrix)

In [75]:
df['headline'][0]

'versace store clerk sue secret black code minority shopper'

In [81]:
df['headline'][26700]

"what's in your mailbox? tips on what to do when uncle sam comes knocking"

### **Splitting the data into training and testing set using train_test_split function from sklearn**

In [84]:
y = news_df['is_sarcastic']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,random_state = 42)

### Model Building 

In [86]:
#creaeting a model object

from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()

#Fitting the model on the training dataset
nb.fit(X_train,y_train)

BernoulliNB()

### **Model Evaluvation**

In [87]:
pred = nb.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix

print('Confusion matrix\n',confusion_matrix(y_test,pred))

Confusion matrix
 [[3704  740]
 [ 628 2941]]


In [88]:
len(df)*3

80127

In [89]:
3704+2941

6645

In [92]:
from sklearn.metrics import  accuracy_score, f1_score, precision_score, recall_score, roc_auc_score

roc = roc_auc_score(y_test,pred)
acc = accuracy_score(y_test,pred)
prec = precision_score(y_test,pred)
recall = recall_score(y_test,pred)
f1 = f1_score(y_test,pred)

In [94]:
results = pd.DataFrame([['Bernoulli Classifier', acc,prec,recall, f1,roc],
                        ],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,ROC
0,Bernoulli Classifier,0.829277,0.798968,0.82404,0.81131,0.828762


In [97]:
print('classification report\n:', classification_report(y_test,pred))

classification report
:               precision    recall  f1-score   support

           0       0.86      0.83      0.84      4444
           1       0.80      0.82      0.81      3569

    accuracy                           0.83      8013
   macro avg       0.83      0.83      0.83      8013
weighted avg       0.83      0.83      0.83      8013

