## Text data

### [Kaggle fake news dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?resource=download)

### [Spacy tutorial 1](https://www.kaggle.com/code/sudalairajkumar/getting-started-with-spacy)

### [Spacy tutorial 2](https://towardsdatascience.com/analysis-and-visualization-of-unstructured-text-data-2de07d9adc84)



In [22]:
!pip3 install spacytextblob

import pandas as pd
from datetime import datetime # for grabbing date ranges
import spacy # for natural language processing
from spacytextblob.spacytextblob import SpacyTextBlob # for sentiment analysis


nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


<spacytextblob.spacytextblob.SpacyTextBlob at 0x7f61da992e90>

### NLP crash course with spacy

+ A text dataset is often called a corpus. Especially if it has been curated in some fashion (labeled, annotated, ...).

+ Spacy is a python package that performs standard NLP analyses of text. 

+ Tokenization: Splitting text into units
  - Sentence 
  - Word based tokenization (wasn't)
  - Subword based tokenization (morpheme-like units)
    + Byte pair encoding
    + WordPiece
    + Unigram
    + SentencePiece
  - Character based tokenization
+ Lemmatization: mapping word variants to canonical root form
  - removing pluralization
  - removing tense
+ Part of speech tagging: Labeling words with part of speech (verb, noun, etc.)
+ Noun phrase identification
+ Named entity recognition (NER)
  - person, place, country, currency, ...
+ Sentiment analysis

In [38]:
text = "Dr. James Harvey, the big furry cat ate the little brown mice who weren't very happy."
doc = nlp(text)
print([t.text for t in doc]) # word based tokenization
print([t.lemma_ for t in doc]) # lemmatization
print([t.pos_ for t in doc]) # Part of speech tagging
print([t.tag_ for t in doc]) # Fine grained part of speech tagging
print([n for n in doc.noun_chunks]) # Noun phrase parsing
spacy.explain('JJ')
print(nlp.get_pipe('ner').labels)
print([f'{ent.text}, {ent.label_}, {spacy.explain(ent.label_)}' for ent in doc.ents])

('CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART')
['James Harvey, PERSON, People, including fictional']


In [39]:
fake = pd.read_csv('drive/MyDrive/DS311/archive/Fake.csv')
real = pd.read_csv('drive/MyDrive/DS311/archive/True.csv')

In [None]:
def clean_dates(df):
  """
  Translates dates from various string formats to python Datetime objects. 
  Adds columns Date: Datetime, month: int, year: int, day: int.
  Also tosses out corrupted rows in the dataframe.

  :param df: (DataFrame) With column 'date'

  :returns: (DataFrame) Cleaned with additional columns Date, month, day, year
  """
  formats = {1: '%d-%b-%y', 0: '%B %d, %Y', 2: '%b %d, %Y'}
  dates = []
  bad = []
  for i, date in enumerate(df['date']):
    try:
      if date[0].isdigit():
        f = formats[1]
      elif len(date.split()[0].strip()) > 3:
        f = formats[0]
      else:
        f = formats[2]
      dates.append(datetime.strptime(date.strip(), f))
    except:
      bad.append([date, format, i])
  print(bad)
  df = df.drop([b[-1] for b in bad])
  df['Date'] = dates
  df['month'] = pd.DatetimeIndex(dates).month
  df['year'] = pd.DatetimeIndex(dates).year
  df['day'] = pd.DatetimeIndex(dates).day
  return df

In [40]:
fake = clean_dates(fake)
real = clean_dates(real)

[['https://100percentfedup.com/served-roy-moore-vietnamletter-veteran-sets-record-straight-honorable-decent-respectable-patriotic-commander-soldier/', <built-in function format>, 9358], ['https://100percentfedup.com/video-hillary-asked-about-trump-i-just-want-to-eat-some-pie/', <built-in function format>, 15507], ['https://100percentfedup.com/12-yr-old-black-conservative-whose-video-to-obama-went-viral-do-you-really-love-america-receives-death-threats-from-left/', <built-in function format>, 15508], ['https://fedup.wpengine.com/wp-content/uploads/2015/04/hillarystreetart.jpg', <built-in function format>, 15839], ['https://fedup.wpengine.com/wp-content/uploads/2015/04/entitled.jpg', <built-in function format>, 15840], ['https://fedup.wpengine.com/wp-content/uploads/2015/04/hillarystreetart.jpg', <built-in function format>, 17432], ['https://fedup.wpengine.com/wp-content/uploads/2015/04/entitled.jpg', <built-in function format>, 17433], ['MSNBC HOST Rudely Assumes Steel Worker Would Neve

In [41]:
short_fake = fake[(fake.year == 2017) & (fake.month == 1)]


In [43]:
short_fake

Unnamed: 0,title,text,subject,date,Date,month,year,day
2749,Trump’s SCOTUS Pick Sided With Hobby Lobby Ag...,"On Tuesday, Donald Trump announced the identit...",News,"January 31, 2017",2017-01-31,1,2017,31
2750,It Took A Scathing Letter From Canada’s Prime...,Fox News couldn t wait to try to spin the Queb...,News,"January 31, 2017",2017-01-31,1,2017,31
2751,WATCH: Jake Tapper STUNNED Into Disbelief Lis...,Sean Spicer is doing his level best to make en...,News,"January 31, 2017",2017-01-31,1,2017,31
2752,An Anonymous Group Just Revealed The Direct P...,"Just after Donald Trump was sworn in, his admi...",News,"January 31, 2017",2017-01-31,1,2017,31
2753,Trump Jr. Just ‘Liked’ Tweet Praising Mosque ...,When it comes to how shameless the Trump famil...,News,"January 31, 2017",2017-01-31,1,2017,31
...,...,...,...,...,...,...,...,...
23076,SOUR GRAPES? Whatever happened to the ‘smooth ...,Andrew Malcolm McClatchy News You better stop...,Middle-east,"January 3, 2017",2017-01-03,1,2017,3
23077,HACKING DEMOCRACY? CIA Accusing Russia of Doin...,Peter Certo Other WordsEven in an election yea...,Middle-east,"January 3, 2017",2017-01-03,1,2017,3
23078,Good News for Silver in 2017,James Burgess Oil PricePrecious metals are an...,Middle-east,"January 3, 2017",2017-01-03,1,2017,3
23079,Gerald Celente: Top 10 Trends for 2017,"What can we expect in 2017? Inflated markets, ...",Middle-east,"January 2, 2017",2017-01-02,1,2017,2


In [42]:
short_real = real[(real.year == 2017) & (real.month == 1)]

In [50]:
# %time real_docs = short_real.title.apply(nlp)
# type(real_docs.iloc[0])
t = 7.14 # place real time here
(t/len(real_docs))*len(real)

203.88983999999996

In [51]:
%time fake_docs = short_fake.title.apply(nlp)

CPU times: user 11.7 s, sys: 80.9 ms, total: 11.8 s
Wall time: 12.2 s


In [53]:
# doc = fake_docs.iloc[0]
# print([t.text for t in doc]) # tokenized text of the document
# print([t.lemma_ for t in doc]) # lemmatized text of the document
# print([t.pos_ for t in doc]) # Part of speech for all tokens in the document
# print([t.tag_ for t in doc]) # Fine-grained part of speech for all tokens in the document
# print(spacy.explain('VBN')) # making sense out of the spacy acronyms
# print([c for c in doc.noun_chunks]) # All the noun phrases in the document
print(len([doc for doc in fake_docs if round(doc._.blob.polarity, 2)  < -.9])) # Sentiment analysis of the document. <0 means negative >0 means positive. 
print(len([doc for doc in real_docs if round(doc._.blob.polarity, 2) < -.9]))

print(len([doc for doc in fake_docs if round(doc._.blob.polarity, 2)  > .9])) # Sentiment analysis of the document. <0 means negative >0 means positive. 
print(len([doc for doc in real_docs if round(doc._.blob.polarity, 2) > .9]))


22
1
18
1


In [None]:
text = 'The big furry cat ate the little brown mouse.'
doc = nlp(text)
print([n for n in doc.noun_chunks])

[The big furry cat, the little brown mouse]


In [54]:
from collections import Counter

c = Counter(['a', 'a', 'b', 'b', 'b'])
print(c)
def count_nouns(doc):
  nouns = [
      token.lemma_ for token in doc if 
            (not token.is_stop and
             not token.is_punct and
             token.tag_ == "NNPS")]
  word_freq = Counter(nouns)
  return word_freq

counts = real_docs.apply(count_nouns).sum().most_common(20)
print(counts)
# # print(fake_docs.apply(count_nouns).sum().most_common(20))

Counter({'b': 3, 'a': 2})
