# NLP(Natural language processing)

  Natural languages(English, Japanese etc.) is different than computer language. It evolves and changes.
  In NLP we are concerned with how to program computer to process and analyze natural language data(text).
    NLP is defined as a  field of computer science  and artificial intelligence with roots in linguistics.

# Why?
- We have large volume of unstructured text data
    + How to apply statistical analysis/machine learning to extract useful insight from this?
- Most of the data analysis is numeric
    + We need specialized technique like NLP

# Applications
- Machine translation (Google translate)
- Speech Recognition Systems (Smart assistants)
- Question Answering Systems (Autodiagnostic on company websites)
- Text summarization, categorization/classification/clustering
    - Sentiment analysis
    - Chatbots
    - Spam detection



Building any of above application is a bit involved process as text is free flowing, unstructured data.
It requires 
- cleaning (misspelled text, duplicates, removing stopwords),
- tokenization: list of words.
- tagging(POS), stemming, lemmatization and 
- conversion to word vector before using any machine learning or statistical technique

- POS tagging: Each word has pos tag indicating  part of the speech.
  + Here is some list list [penn pos](https://www.clips.uantwerpen.be/pages/mbsp-tags)
  + Here is a demo [Parts-of-speech.Info](https://parts-of-speech.info/)



# Stemming and lemmatization
In NLP, Stemming and lemmatization are text normalization technique to prepare text(word, sentence etc.) for further processing. In web search and information retrieval it is a common activity to increase the recall. These are common step after tokenization of text.

- Stem: Part of the words to which affixes can be added. 
A stem is a part of a word to which [inflectional](https://en.wikipedia.org/wiki/Inflection) affixes**(ed, ing, ize, s, de)** can be attached.   Stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root(may not be a word). Like apple and apples down to appl if you [Porter stemmer](https://tartarus.org/martin/PorterStemmer/)

- Lemma: A lemma is the  base form(part of the language) for a set of words like  geese to goose.
    + Note: stem of these words woud be gees and goos as per Porter stemmer. Lemmatization((careful approach to removing inflections)) is the process of creating  base form typically based on lexical knowledge base like [WordNet](https://wordnet.princeton.edu/).

Stemming is not perfect. Porter stemming stems both meanness and meaning to mean, creating a false equivalence.


Let's see some available text data(Text corpora).
# Text corpora

Large amount of written or spoken textual data. It has usually associated with some meta data.


# Some popular corpora
- Brown Corpus: This was the first million-word corpus for the English language, published by Kucera and Francis in 1961.
- WordNet: This corpus is a semantic-oriented lexical database for the English language. It was created at Princeton University
- Penn Treebank: This corpus consists of tagged and parsed English sentences including annotations like POS tags and grammar-based parse trees.
- Google N-gram Corpus: The Google N-gram Corpus consists of over a trillion words from various sources including books, web pages etc.
- Web, chat, email, tweets: We can gather this kind of textual data from social media.



# Some popular framework for text analysis
- nltk Natural Language Toolkit
- gensim: The gensim library has a rich set of capabilities for semantic analysis, including topic modeling and similarity analysis.
- textblob: text processing, phrase extraction, classification, POS tagging, and sentiment analysis
- spacy: claims to provide industrial-strength NLP capabilities by providing the best implementation of each technique and algorithm

# Let's use nltk to access some corpora using nltk

In [None]:
#If you have a Mac you may also need :
# xcode-select --install 
#This code allows you to construct the C dependencies that are part of regex used in nltk


#To install the nltk library :
# !pip install nltk

In [1]:
import nltk

In [2]:
#You will need this line of command to actually install the files 
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package cess_cat is alrea

[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package toolbox to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /Users/csresearch/nltk_data...
[nltk_data]    |   Package unicode_sam

True

# Accessing the Brown Corpus

In [3]:
from nltk.corpus import brown
brown.readme()

'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'

In [4]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

We can access the tokenized sentences like

In [5]:
sentences = brown.sents(categories='adventure')

In [6]:
sentences

[['Dan', 'Morgan', 'told', 'himself', 'he', 'would', 'forget', 'Ann', 'Turner', '.'], ['He', 'was', 'well', 'rid', 'of', 'her', '.'], ...]

In [7]:
# Get the original sentence
[' '.join(sent) for sent in sentences]

['Dan Morgan told himself he would forget Ann Turner .',
 'He was well rid of her .',
 "He certainly didn't want a wife who was fickle as Ann .",
 "If he had married her , he'd have been asking for trouble .",
 'But all of this was rationalization .',
 'Sometimes he woke up in the middle of the night thinking of Ann , and then could not get back to sleep .',
 'His plans and dreams had revolved around her so much and for so long that now he felt as if he had nothing .',
 "The easiest thing would be to sell out to Al Budd and leave the country , but there was a stubborn streak in him that wouldn't allow it .",
 'The best antidote for the bitterness and disappointment that poisoned him was hard work .',
 'He found that if he was tired enough at night , he went to sleep simply because he was too exhausted to stay awake .',
 'Each day he found himself thinking less often of Ann ; ;',
 'each day the hurt was a little duller , a little less poignant .',
 'He had plenty of work to do .',
 'Bec

# POS tagged sentences

In [8]:
tagged_sents= brown.tagged_sents(categories='humor')
tagged_sents

[[('It', 'PPS'), ('was', 'BEDZ'), ('among', 'IN'), ('these', 'DTS'), ('that', 'CS'), ('Hinkle', 'NP'), ('identified', 'VBD'), ('a', 'AT'), ('photograph', 'NN'), ('of', 'IN'), ('Barco', 'NP'), ('!', '.'), ('!', '.')], [('For', 'CS'), ('it', 'PPS'), ('seems', 'VBZ'), ('that', 'CS'), ('Barco', 'NP'), (',', ','), ('fancying', 'VBG'), ('himself', 'PPL'), ('a', 'AT'), ("ladies'", 'NNS$'), ('man', 'NN'), ('(', '('), ('and', 'CC'), ('why', 'WRB'), ('not', '*'), (',', ','), ('after', 'IN'), ('seven', 'CD'), ('marriages', 'NNS'), ('?', '.'), ('?', '.')], ...]

# Let's get top noun in humor and see its distribution

In [9]:
from collections import Counter

#We iterate over our POS tags and extract only the ones we are interested in
noun = []
for sent in tagged_sents:
    noun+=[w for w,tag in sent if tag in ['NN', 'NP','NNS']]
noun

#Count the occurrences in our test
noun_counter = Counter(noun)

#To get the top 3 :
noun_counter.most_common(3)

[('time', 43), ('Mr.', 36), ('way', 28)]

# Can use nltk freqdist

In [10]:
nouns_freq = nltk.FreqDist(noun)
nouns_freq

FreqDist({'time': 43, 'Mr.': 36, 'way': 28, 'things': 27, 'Arlene': 24, 'man': 21, 'years': 21, 'children': 20, 'day': 19, 'people': 19, ...})

In [11]:
type(nouns_freq)

nltk.probability.FreqDist

In [12]:
# orignal doc id
len(brown.fileids())

500

# Wordnet

In [13]:
from nltk.corpus import wordnet

word synsets

In [14]:
word_synsets= wordnet.synsets('hiking')
word_synsets

[Synset('hike.n.01'), Synset('hike.v.01'), Synset('hike.v.02')]

In [15]:
#We can extract the name of the synsets, definition and even examples
for synset in word_synsets:
    print('name:',synset.name())
    print('definition:',synset.definition())
    print('examples:',synset.examples())

name: hike.n.01
definition: a long walk usually for exercise or pleasure
examples: ['she enjoys a hike in her spare time']
name: hike.v.01
definition: increase
examples: ['The landlord hiked up the rents']
name: hike.v.02
definition: walk a long way, as for pleasure or physical exercise
examples: ['We were hiking in Colorado', 'hike the Rockies']


# Text Preprocessing

# Tokenization

Breaking down or splitting textual data into smaller meaningful components.
- Sentence tokenization
- words tokenization

# tokenizer in nltk
- sent_tokenize
- RegexpTokenizer

Check the documentation for other tokenizer

In [16]:
import nltk
from nltk.corpus import gutenberg

In [17]:
edgeworth = gutenberg.raw(fileids='edgeworth-parents.txt')
edgeworth[0:1000]

'[The Parent\'s Assistant, by Maria Edgeworth]\r\n\r\n\r\nTHE ORPHANS.\r\n\r\nNear the ruins of the castle of Rossmore, in Ireland, is a small cabin,\r\nin which there once lived a widow and her four children.  As long as she\r\nwas able to work, she was very industrious, and was accounted the best\r\nspinner in the parish; but she overworked herself at last, and fell ill,\r\nso that she could not sit to her wheel as she used to do, and was obliged\r\nto give it up to her eldest daughter, Mary.\r\n\r\nMary was at this time about twelve years old.  One evening she was\r\nsitting at the foot of her mother\'s bed spinning, and her little brothers\r\nand sisters were gathered round the fire eating their potatoes and milk\r\nfor supper.  "Bless them, the poor young creatures!" said the widow, who,\r\nas she lay on her bed, which she knew must be her deathbed, was thinking\r\nof what would become of her children after she was gone.  Mary stopped\r\nher wheel, for she was afraid that the nois

In [18]:
edgeworth_sent_tkn = nltk.sent_tokenize(edgeworth)
edgeworth_sent_tkn

["[The Parent's Assistant, by Maria Edgeworth]\r\n\r\n\r\nTHE ORPHANS.",
 'Near the ruins of the castle of Rossmore, in Ireland, is a small cabin,\r\nin which there once lived a widow and her four children.',
 'As long as she\r\nwas able to work, she was very industrious, and was accounted the best\r\nspinner in the parish; but she overworked herself at last, and fell ill,\r\nso that she could not sit to her wheel as she used to do, and was obliged\r\nto give it up to her eldest daughter, Mary.',
 'Mary was at this time about twelve years old.',
 "One evening she was\r\nsitting at the foot of her mother's bed spinning, and her little brothers\r\nand sisters were gathered round the fire eating their potatoes and milk\r\nfor supper.",
 '"Bless them, the poor young creatures!"',
 'said the widow, who,\r\nas she lay on her bed, which she knew must be her deathbed, was thinking\r\nof what would become of her children after she was gone.',
 'Mary stopped\r\nher wheel, for she was afraid that

In [19]:
print('Total sentences  {}'.format(len(edgeworth_sent_tkn)))
for sidx, s in enumerate(edgeworth_sent_tkn[0:3]):
    print(sidx,"::", s, '\n')

Total sentences  10096
0 :: [The Parent's Assistant, by Maria Edgeworth]


THE ORPHANS. 

1 :: Near the ruins of the castle of Rossmore, in Ireland, is a small cabin,
in which there once lived a widow and her four children. 

2 :: As long as she
was able to work, she was very industrious, and was accounted the best
spinner in the parish; but she overworked herself at last, and fell ill,
so that she could not sit to her wheel as she used to do, and was obliged
to give it up to her eldest daughter, Mary. 



# Word Tokenization

Converts sentence into word token. Typical process before stemming or lemmentizing.

- word_tokenize
- TreebankWordTokenizer. Based on the Penn Treebank and uses various regular expressions to tokenize the text.
- RegexpTokenizer

In [20]:
s = 'I just absolutely adore Denver and the Boulder area.'

In [21]:
nltk.word_tokenize(s)

['I',
 'just',
 'absolutely',
 'adore',
 'Denver',
 'and',
 'the',
 'Boulder',
 'area',
 '.']

In [22]:
word_regex= nltk.RegexpTokenizer(pattern=r'\w+', gaps=False)
word_regex.tokenize(s)

['I', 'just', 'absolutely', 'adore', 'Denver', 'and', 'the', 'Boulder', 'area']

we can get start and end indices

In [23]:
list(word_regex.span_tokenize(s))

[(0, 1),
 (2, 6),
 (7, 17),
 (18, 23),
 (24, 30),
 (31, 34),
 (35, 38),
 (39, 46),
 (47, 51)]

In [24]:
#Using the start and end indices to extract the words :
[s[st:en] for st,en in word_regex.span_tokenize(s)]

['I', 'just', 'absolutely', 'adore', 'Denver', 'and', 'the', 'Boulder', 'area']

There are other word tokenizer classes. Check the documentation but to give you a flavour here is WordPunctTokenizer. It tokenize sentences into independent alphabetic and non-alphabetic tokens.

In [25]:
wordpunkt_tkn = nltk.WordPunctTokenizer()
wordpunkt_tkn.tokenize("He couldn't swim" )

['He', 'couldn', "'", 't', 'swim']

# RegexpTokenizer

In [26]:
s ="Price of a gallon milk is $3.50.  I'll buy 2. Thanks."

#Let's try the built in word tokenizer on for size :
nltk.word_tokenize(s)

['Price',
 'of',
 'a',
 'gallon',
 'milk',
 'is',
 '$',
 '3.50',
 '.',
 'I',
 "'ll",
 'buy',
 '2',
 '.',
 'Thanks',
 '.']

In [27]:
#With the following tokenizer we can guarantee that our sentence will be processed correctly
sent_regex = nltk.tokenize.RegexpTokenizer('\w+|\$\d+\.\d+|\s+')

In [28]:
sent_regex.tokenize(s)

['Price',
 ' ',
 'of',
 ' ',
 'a',
 ' ',
 'gallon',
 ' ',
 'milk',
 ' ',
 'is',
 ' ',
 '$3.50',
 '  ',
 'I',
 'll',
 ' ',
 'buy',
 ' ',
 '2',
 ' ',
 'Thanks']

# Text normalization or wrangling
Apart from tokenization
- Cleaning
- Case conversion
- Spell correction
- Removing stop words
- Stemming,
- Lemmatization

# Cleaning text
 Remove any unnecessary tokens.
 
 Like from html we don't care about tags
- use regex, Beautiful soup

In [5]:
# example to get text from html
import requests
from bs4 import BeautifulSoup as bsp
response = requests.get('https://en.wikipedia.org/wiki/World_economy')
response.status_code

200

In [6]:
response.headers

{'Date': 'Thu, 29 Apr 2021 00:23:08 GMT', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'Server': 'ATS/8.0.8', 'X-Content-Type-Options': 'nosniff', 'P3p': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'Content-Language': 'en', 'X-Request-Id': '184a3a7b-d222-47ce-a07f-52256d4269c8', 'Last-Modified': 'Sat, 24 Apr 2021 15:25:53 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Encoding': 'gzip', 'Age': '3697', 'X-Cache': 'cp2037 miss, cp2035 hit/3', 'X-Cache-Status': 'hit-front', 'Server-Timing': 'cache;desc="hit-front", host;desc="cp2035"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Report-To': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'NEL': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}', 'Pe

In [7]:
soupify = bsp(response.text, 'lxml')

In [8]:
soupify.get_text()[5039:7861]

'PP)\n\nPeak year\n\nNumber of countries\n\nMembers of G20 economies and/or largest in the group (Mutually exclusive)\n\n\nWorld\n\n141,962,059\n\n2021\n\n195\n\n\n\n\nEmerging and developing Asia\n\n46,770,009\n\n2021\n\n30\n\n\xa0China\xa0India\xa0Indonesia\xa0Malaysia\xa0Philippines\xa0Thailand\xa0Vietnam\n\n\nMajor advanced economies (G7)\n\n44,000,957\n\n2021\n\n7\n\n\xa0Canada\xa0France\xa0Germany\xa0Italy\xa0Japan\xa0United Kingdom\xa0United States\n\n\nOther advanced economies(advanced economies excluding the G7)\n\n15,882,382\n\n2021\n\n32\n\n\xa0Australia\xa0South Korea\xa0Netherlands\xa0Spain\xa0\xa0Switzerland\xa0Taiwan\n\n\nEmerging and developing Europe\n\n10,828,956\n\n2019\n\n16\n\n\xa0Poland\xa0Russia\xa0Turkey\n\n\nLatin America and the Caribbean\n\n10,212,082\n\n2019\n\n33\n\n\xa0Argentina\xa0Brazil\xa0Colombia\xa0Mexico\xa0Venezuela\n\n\nMiddle East and Central Asia\n\n9,946,269\n\n2021\n\n32\n\n\xa0Egypt\xa0Iran\xa0Pakistan\xa0Saudi Arabia\xa0United Arab Emirates\n

In [9]:
len('\n\n\nWorld economy - Wikipedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );\n(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"World_economy","wgTitle":"World economy","wgCurRevisionId":882908028,"wgRevisionId":882908028,"wgArticleId":227630,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","All accuracy disputes","Articles with disputed statements from December 2016","Articles to be expanded from January 2019","All articles to be expanded","Articles using small message boxes","Wikipedia articles with GND identifiers","Economics catchphrases","World economy","Economic globalization"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"World_economy","wgRelevantArticleId":227630,"wgRequestId":"XG3CNApAMFwAAG7O-igAAABS","wgCSPNonce":false,"wgIsProbablyEditable":true,"wgRelevantPageIsProbablyEditable":true,"wgRestrictionEdit":[],"wgRestrictionMove":[],"wgFlaggedRevsParams":{"tags":{}},"wgStableRevisionId":null,"wgCategoryTreePageCategoryOptions":"{\\"mode\\":0,\\"hideprefix\\":20,\\"showcount\\":true,\\"namespaces\\":false}","wgWikiEditorEnabledModules":[],"wgBetaFeaturesFeatures":[],"wgMediaViewerOnClick":true,"wgMediaViewerEnabledByDefault":true,"wgPopupsShouldSendModuleToUser":true,"wgPopupsConflictsWithNavPopupGadget":false,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en","usePageImages":true,"usePageDescriptions":true},"wgMFIsPageContentModelEditable":true,"wgMFEnableFontChanger":true,"wgMFDisplayWikibaseDescriptions":{"search":true,"nearby":true,"watchlist":true,"tagline":false},"wgRelatedArticles":null,"wgRelatedArticlesUseCirrusSearch":true,"wgRelatedArticlesOnlyUseCirrusSearch":false,"wgWMESchemaEditAttemptStepOversample":false,"wgPoweredByHHVM":true,"wgULSCurrentAutonym":"English","wgNoticeProject":"wikipedia","wgCentralNoticeCookiesToDelete":[],"wgCentralNoticeCategoriesUsingLegacy":["Fundraising","fundraising"],"wgWikibaseItemId":"Q473750","wgScoreNoteLanguages":{"arabic":"العربية","catalan":"català","deutsch":"Deutsch","english":"English","espanol":"español","italiano":"italiano","nederlands":"Nederlands","norsk":"norsk","portugues":"português","suomi":"suomi","svenska":"svenska","vlaams":"West-Vlams"},"wgScoreDefaultNoteLanguage":"nederlands","wgCentralAuthMobileDomain":false,"wgCodeMirrorEnabled":true,"wgVisualEditorToolbarScrollOffset":0,"wgVisualEditorUnsupportedEditParams":["undo","undoafter","veswitched"],"wgEditSubmitButtonLabelPublish":true,"oresWikiId":"enwiki","oresBaseUrl":"http://ores.discovery.wmnet:8081/","oresApiVersion":3});mw.loader.state({"ext.gadget.charinsert-styles":"ready","ext.globalCssJs.user.styles":"ready","ext.globalCssJs.site.styles":"ready","site.styles":"ready","noscript":"ready","user.styles":"ready","ext.globalCssJs.user":"ready","ext.globalCssJs.site":"ready","user":"ready","user.options":"ready","user.tokens":"loading","ext.cite.styles":"ready","mediawiki.legacy.shared":"ready","mediawiki.legacy.commonPrint":"ready","mediawiki.toc.styles":"ready","wikibase.client.init":"ready","ext.visualEditor.desktopArticleTarget.noscript":"ready","ext.uls.interlanguage":"ready","ext.wikimediaBadges":"ready","ext.3d.styles":"ready","mediawiki.skinning.interface":"ready","skins.vector.styles":"ready"});mw.loader.implement("user.tokens@0tffind",function($,jQuery,require,module){/*@nomin*/mw.user.tokens.set({"editToken":"+\\\\","patrolToken":"+\\\\","watchToken":"+\\\\","csrfToken":"+\\\\"});\n});RLPAGEMODULES=["ext.cite.ux-enhancements","ext.scribunto.logs","site","mediawiki.page.startup","mediawiki.page.ready","mediawiki.toc","mediawiki.searchSuggest","ext.gadget.teahouse","ext.gadget.ReferenceTooltips","ext.gadget.watchlist-notice","ext.gadget.DRN-wizard","ext.gadget.charinsert","ext.gadget.refToolbar","ext.gadget.extra-toolbar-buttons","ext.gadget.switcher","ext.centralauth.centralautologin","mmv.head","mmv.bootstrap.autostart","ext.popups","ext.visualEditor.desktopArticleTarget.init","ext.visualEditor.targetLoader","ext.eventLogging","ext.wikimediaEvents","ext.navigationTiming","ext.uls.eventlogger","ext.uls.init","ext.uls.compactlinks","ext.uls.interface","ext.quicksurveys.init","ext.centralNotice.geoIP","ext.centralNotice.startUp","skins.vector.js"];mw.loader.load(RLPAGEMODULES);});\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\nWorld economy \nFrom Wikipedia, the free encyclopedia \n Jump to navigation\nJump to search\nFor other uses, see World Economy (disambiguation).\nWorld economy\nAfrica\nAmericas\nCentral America\nNorth America\nSouth America\nAsia\nEast Asia\nEurope\nOceania\nvte\nThe world economy or global economy is the economy of the humans of the world, considered as the international exchange of goods and services that is expressed in monetary units of account.[1] In some contexts, the two terms are distinguished: the "international" or "global economy" being measured separately and distinguished from national economies while the "world economy" is simply an aggregate of the separate countries\' measurements. Beyond the minimum standard concerning value in production, use and exchange the definitions, representations, models and valuations of the world economy vary widely. It is inseparable from the geography and ecology of Earth.\nIt is common to limit questions of the world economy exclusively to human economic activity and the world economy is typically judged in monetary terms, even in cases in which there is no efficient market to help valuate certain goods or services, or in cases in which a lack of independent research or government cooperation makes establishing figures difficult. Typical examples are illegal drugs and other black market goods, which by any standard are a part of the world economy, but for which there is by definition no legal market of any kind.\nHowever, even in cases in which there is a clear and efficient market to establish a monetary value, economists do not typically use the current or official exchange rate to translate the monetary units of this market into a single unit for the world economy since exchange rates typically do not closely reflect worldwide value, for example in cases where the volume or price of transactions is closely regulated by the government.\n\n World share of GDP (PPP) (World Bank, 2011)[2]\nRather, market valuations in a local currency are typically translated to a single monetary unit using the idea of purchasing power. This is the method used below, which is used for estimating worldwide economic activity in terms of real United States dollars or euros. However, the world economy can be evaluated and expressed in many more ways. It is unclear, for example, how many of the world\'s 7.62 billion people have most of their economic activity reflected in these valuations.\nAccording to Maddison, until the middle of 19th century, global output was dominated by China and India. Waves of Industrial Revolution in Western Europe and Northern America shifted the shares to the Western Hemisphere. As of 2017, the following 15 countries or regions have reached an economy of at least US$2 trillion by GDP in nominal or PPP terms: Brazil, China, India, Germany, France, Indonesia, Italy, Japan, South Korea, Mexico, Russia, Turkey, the United Kingdom, the United States and the European Union')

7861

In [10]:
soupify.get_text()



In [11]:
len('\n\n\nWorld economy - Wikipedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );\n(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"World_economy","wgTitle":"World economy","wgCurRevisionId":882908028,"wgRevisionId":882908028,"wgArticleId":227630,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","All accuracy disputes","Articles with disputed statements from December 2016","Articles to be expanded from January 2019","All articles to be expanded","Articles using small message boxes","Wikipedia articles with GND identifiers","Economics catchphrases","World economy","Economic globalization"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"World_economy","wgRelevantArticleId":227630,"wgRequestId":"XG3CNApAMFwAAG7O-igAAABS","wgCSPNonce":false,"wgIsProbablyEditable":true,"wgRelevantPageIsProbablyEditable":true,"wgRestrictionEdit":[],"wgRestrictionMove":[],"wgFlaggedRevsParams":{"tags":{}},"wgStableRevisionId":null,"wgCategoryTreePageCategoryOptions":"{\\"mode\\":0,\\"hideprefix\\":20,\\"showcount\\":true,\\"namespaces\\":false}","wgWikiEditorEnabledModules":[],"wgBetaFeaturesFeatures":[],"wgMediaViewerOnClick":true,"wgMediaViewerEnabledByDefault":true,"wgPopupsShouldSendModuleToUser":true,"wgPopupsConflictsWithNavPopupGadget":false,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en","usePageImages":true,"usePageDescriptions":true},"wgMFIsPageContentModelEditable":true,"wgMFEnableFontChanger":true,"wgMFDisplayWikibaseDescriptions":{"search":true,"nearby":true,"watchlist":true,"tagline":false},"wgRelatedArticles":null,"wgRelatedArticlesUseCirrusSearch":true,"wgRelatedArticlesOnlyUseCirrusSearch":false,"wgWMESchemaEditAttemptStepOversample":false,"wgPoweredByHHVM":true,"wgULSCurrentAutonym":"English","wgNoticeProject":"wikipedia","wgCentralNoticeCookiesToDelete":[],"wgCentralNoticeCategoriesUsingLegacy":["Fundraising","fundraising"],"wgWikibaseItemId":"Q473750","wgScoreNoteLanguages":{"arabic":"العربية","catalan":"català","deutsch":"Deutsch","english":"English","espanol":"español","italiano":"italiano","nederlands":"Nederlands","norsk":"norsk","portugues":"português","suomi":"suomi","svenska":"svenska","vlaams":"West-Vlams"},"wgScoreDefaultNoteLanguage":"nederlands","wgCentralAuthMobileDomain":false,"wgCodeMirrorEnabled":true,"wgVisualEditorToolbarScrollOffset":0,"wgVisualEditorUnsupportedEditParams":["undo","undoafter","veswitched"],"wgEditSubmitButtonLabelPublish":true,"oresWikiId":"enwiki","oresBaseUrl":"http://ores.discovery.wmnet:8081/","oresApiVersion":3});mw.loader.state({"ext.gadget.charinsert-styles":"ready","ext.globalCssJs.user.styles":"ready","ext.globalCssJs.site.styles":"ready","site.styles":"ready","noscript":"ready","user.styles":"ready","ext.globalCssJs.user":"ready","ext.globalCssJs.site":"ready","user":"ready","user.options":"ready","user.tokens":"loading","ext.cite.styles":"ready","mediawiki.legacy.shared":"ready","mediawiki.legacy.commonPrint":"ready","mediawiki.toc.styles":"ready","wikibase.client.init":"ready","ext.visualEditor.desktopArticleTarget.noscript":"ready","ext.uls.interlanguage":"ready","ext.wikimediaBadges":"ready","ext.3d.styles":"ready","mediawiki.skinning.interface":"ready","skins.vector.styles":"ready"});mw.loader.implement("user.tokens@0tffind",function($,jQuery,require,module){/*@nomin*/mw.user.tokens.set({"editToken":"+\\\\","patrolToken":"+\\\\","watchToken":"+\\\\","csrfToken":"+\\\\"});\n});RLPAGEMODULES=["ext.cite.ux-enhancements","ext.scribunto.logs","site","mediawiki.page.startup","mediawiki.page.ready","mediawiki.toc","mediawiki.searchSuggest","ext.gadget.teahouse","ext.gadget.ReferenceTooltips","ext.gadget.watchlist-notice","ext.gadget.DRN-wizard","ext.gadget.charinsert","ext.gadget.refToolbar","ext.gadget.extra-toolbar-buttons","ext.gadget.switcher","ext.centralauth.centralautologin","mmv.head","mmv.bootstrap.autostart","ext.popups","ext.visualEditor.desktopArticleTarget.init","ext.visualEditor.targetLoader","ext.eventLogging","ext.wikimediaEvents","ext.navigationTiming","ext.uls.eventlogger","ext.uls.init","ext.uls.compactlinks","ext.uls.interface","ext.quicksurveys.init","ext.centralNotice.geoIP","ext.centralNotice.startUp","skins.vector.js"];mw.loader.load(RLPAGEMODULES);});\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\nWorld economy \nFrom Wikipedia, the free encyclopedia \n Jump to navigation\nJump to search\nFor other uses, see World Economy (disambiguation).\nWorld economy\nAfrica\nAmericas\nCentral America\nNorth America\nSouth America\nAsia\nEast Asia\nEurope\nOceania\nvte\n')

5160

# Working on a corpus
# Tokenization

In [12]:
corpus = ['Meet Google Fi, a different kind of phone plan @@plan', '*Simpler* pricing and smarter coverage. It has unlimited call and text at $20!']
corpus


['Meet Google Fi, a different kind of phone plan @@plan',
 '*Simpler* pricing and smarter coverage. It has unlimited call and text at $20!']

In [14]:
import nltk
sent_tokens = []
for doc in corpus:
    sent_tokens.append(nltk.sent_tokenize(doc))
sent_tokens

[['Meet Google Fi, a different kind of phone plan @@plan'],
 ['*Simpler* pricing and smarter coverage.',
  'It has unlimited call and text at $20!']]

In [15]:
words_tokens= []
for doc in corpus:
    sent_tokens= nltk.sent_tokenize(doc)
    words_tokens.append([nltk.word_tokenize(sent) for sent in sent_tokens])
        

In [16]:
words_tokens

[[['Meet',
   'Google',
   'Fi',
   ',',
   'a',
   'different',
   'kind',
   'of',
   'phone',
   'plan',
   '@',
   '@',
   'plan']],
 [['*', 'Simpler', '*', 'pricing', 'and', 'smarter', 'coverage', '.'],
  ['It', 'has', 'unlimited', 'call', 'and', 'text', 'at', '$', '20', '!']]]

In [17]:
import re
import string
pattern = '[{}]'.format(re.escape(string.punctuation))
pattern

'[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~]'

In [26]:
## Here we build a regex to remove punctuations
## work with words_tokens[0][0] sentence
words_tokens[0][0]
punc_regex = re.compile(pattern)
clean_sent = list(filter(None , [punc_regex.sub('', token)  for token in  words_tokens[0][0] ]))

In [27]:
clean_sent

['Meet',
 'Google',
 'Fi',
 'a',
 'different',
 'kind',
 'of',
 'phone',
 'plan',
 'plan']

# Removing stop words

Words that end up occurring the most like a, the, am.

In [28]:
stopwords = nltk.corpus.stopwords.words('english')
stop_clean_sent = [w for w in clean_sent if w not in stopwords]
print(stopwords)
stop_clean_sent

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

['Meet', 'Google', 'Fi', 'different', 'kind', 'phone', 'plan', 'plan']

# Stemming

Normalize words into its base form or root form.

In [29]:
from nltk.stem import PorterStemmer
pstemmer = PorterStemmer()

In [30]:
pstemmer.stem('helped'), pstemmer.stem('helping')

('help', 'help')

In [37]:
pstemmer.stem('strange')

'strang'

In [38]:
from nltk.stem import LancasterStemmer
ls_stemmer = LancasterStemmer()
ls_stemmer.stem('strange')

'strange'

# Regex based stemmer

In [39]:
# Here we have a regex for words ending with ed or ing
from nltk.stem import RegexpStemmer
regex_stemmer = RegexpStemmer(r'ed$|ing$|es$', min=4)

In [40]:
regex_stemmer.stem('played'), regex_stemmer.stem('apples')

('play', 'appl')

# Lemmatization
Get the root word in the dictionary.

In [41]:
from nltk.stem import WordNetLemmatizer
wnetl = WordNetLemmatizer()

In [42]:
# noun
wnetl.lemmatize('buses', 'n')

'bus'

In [43]:
# verb
wnetl.lemmatize('running', 'v'), wnetl.lemmatize('ate', 'v')

('run', 'eat')

In [44]:
# adjective
wnetl.lemmatize('easier', 'a')

'easy'

Use right part of speech

In [45]:
wnetl.lemmatize('ate','n')

'ate'

# Side: getting bash output to python

In [46]:
%%bash --out out
curl -s http://www.gutenberg.org/cache/epub/18674/pg18674.txt

In [47]:
out

