## Practice notebook for dataprocessing in NLP : Lemmatizing and stopward removal

**stopwords is dictionary of common words such as and,i,you,so ; which do not add any value during the sentimental analysis. Therefore we remove these words from corpus**

**Lemmatizing is the process in which if come, coming, came word repeats in sentence then it will replace it with "come". So it checks repeating words and replace it with most with most appropriate word"**

**The difference between Stemming and Lemmatizing is, lemmatizing will give you meaningfull word and stemming will give you only stem of the word without any meaning**

## 1. Importing WordNetLemmatizer for Lemmatizing and stopwards to remove common words

In [1]:
import nltk

In [6]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizing = WordNetLemmatizer()

## 2. Importing Information about NLP from wikipedia

In [3]:
Corpus= """Symbolic NLP (1950s - early 1990s)
The premise of symbolic NLP is well-summarized by John Searle's Chinese room experiment: 
Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), 
the computer emulates natural language understanding (or other NLP tasks) 
by applying those rules to the data it is confronted with.

1950s: The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine translation would be a solved problem.
[2] However, real progress was much slower, and after the ALPAC report in 1966, 
which found that ten-year-long research had failed to fulfill the expectations,
funding for machine translation was dramatically reduced. 
Little further research in machine translation was conducted until 
the late 1980s when the first statistical machine translation systems were developed.
1960s: Some notably successful natural language processing systems developed in the 1960s were SHRDLU, 
a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, 
a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966. 
Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. 
When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example,
responding to "My head hurts" with "Why do you say your head hurts?".
1970s: During the 1970s, many programmers began to write "conceptual ontologies", 
which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), 
SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, the first many chatterbots were written (e.g., PARRY).
1980s: The 1980s and early 1990s mark the hey-day of symbolic methods in NLP. 
Focus areas of the time included research on rule-based parsing 
(e.g., the development of HPSG as a computational operationalization of generative grammar), 
morphology (e.g., two-level morphology[3]), semantics (e.g., Lesk algorithm),
reference (e.g., within Centering Theory[4]) and other areas of natural language understanding 
(e.g., in the Rhetorical Structure Theory). Other lines of research were continued, 
e.g., the development of chatterbots with Racter and Jabberwacky. 
An important development (that eventually led to the statistical turn in the 1990s) was the rising importance of 
quantitative evaluation in this period.[5]Statistical NLP (1990s - 2010s)
Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. 
Starting in the late 1980s, however, there was a revolution in natural language processing with the 
introduction of machine learning algorithms for language processing. 
This was due to both the steady increase in computational power (see Moore's law)
and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), 
whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[6]

1990s: Many of the notable early successes on statistical methods in NLP occurred 
in the field of machine translation, due especially to work at IBM Research. 
These systems were able to take advantage of existing multilingual textual corpora that had been produced by 
the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings
into all official languages of the corresponding systems of government. 
However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, 
which was (and often continues to be) a major limitation in the success of these systems. 
As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
2000s: With the growth of the web, increasing amounts of raw (unannotated) language data has become available since the 
mid-1990s.
Research has thus increasingly focused on unsupervised and semi-supervised learning algorithms. 
Such algorithms can learn from data that has not been hand-annotated with the desired answers or using a 
combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning,
and typically produces less accurate results for a given amount of input data. 
However, there is an enormous amount of non-annotated data available 
(including, among other things, the entire content of the World Wide Web), 
which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.
Neural NLP (present)In the 2010s, 
representation learning and deep neural network-style machine learning methods became widespread in natural language processing,
due in part to a flurry of results showing that such techniques[7][8] can achieve state-of-the-art results in 
many natural language tasks, for example in language modeling,[9] parsing,[10][11] and many others."""



## 3. Converting entire paragraph into list of sentences

In [4]:
sentenses = nltk.sent_tokenize(Corpus)

## 4. Lemmatizing the sentences and removing the stopwards

**Steps:**

1. Converting each sentense into list of word 
2. Keeping only word which is not in stopword dataset
3. Lemmatizing the remaining word
4. Joining all words again to form a sentence

In [10]:
for i in range(len(sentenses)):
    words_in_sentenses = nltk.word_tokenize(sentenses[i])
    words_in_sentenses = [lemmatizing.lemmatize(word) for word in words_in_sentenses if word not in set(stopwords.words('english'))]
    sentenses[i] = ' '.join(words_in_sentenses)

In [11]:
sentenses

["Symbolic NLP ( 1950s - early 1990s ) The premise symbolic NLP well-summarized John Searle 's Chinese room experiment : Given collection rule ( e.g . , Chinese phrasebook , question matching answer ) , computer emulates natural language understanding ( NLP task ) applying rule data confronted .",
 '1950s : The Georgetown experiment 1954 involved fully automatic translation sixty Russian sentence English .',
 'The author claimed within three five year , machine translation would solved problem .',
 '[ 2 ] However , real progress much slower , ALPAC report 1966 , found ten-year-long research failed fulfill expectation , funding machine translation dramatically reduced .',
 'Little research machine translation conducted late 1980s first statistical machine translation system developed .',
 '1960s : Some notably successful natural language processing system developed 1960s SHRDLU , natural language system working restricted `` block world `` restricted vocabulary , ELIZA , simulation Roge