**Data preprocessing in python**
Pre-processing refers to the transformations applied to our data before
 feeding it to the algorithm. Data Preprocessing is a technique that is used to
 convert the raw data into a clean data set. In other words, whenever the data is
 gathered from different sources it is collected in raw format which is not feasible
 for the analysis.

 NLP Techniques in data science:
 1. Tokenize text using NLTK in python
 2. Removing stop words with NLTK in Python
 3. Lemmatization with NLTK
 4. Stemming words with NLTK


In [None]:
!pip install nltk -U
!pip install bs4 -U

In [5]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [6]:
para = 'In the magical world of Harry Potter, Hogwarts School of Witchcraft and Wizardry stands as a beacon of hope and learning for young witches and wizards.'
print(para)

In the magical world of Harry Potter, Hogwarts School of Witchcraft and Wizardry stands as a beacon of hope and learning for young witches and wizards.


In [7]:
para.split()

['In',
 'the',
 'magical',
 'world',
 'of',
 'Harry',
 'Potter,',
 'Hogwarts',
 'School',
 'of',
 'Witchcraft',
 'and',
 'Wizardry',
 'stands',
 'as',
 'a',
 'beacon',
 'of',
 'hope',
 'and',
 'learning',
 'for',
 'young',
 'witches',
 'and',
 'wizards.']

In [9]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

In [14]:
#number of sentences
sent = sent_tokenize(para)
sent[0]

'In the magical world of Harry Potter, Hogwarts School of Witchcraft and Wizardry stands as a beacon of hope and learning for young witches and wizards.'

In [16]:
words = word_tokenize(para)
words

['In',
 'the',
 'magical',
 'world',
 'of',
 'Harry',
 'Potter',
 ',',
 'Hogwarts',
 'School',
 'of',
 'Witchcraft',
 'and',
 'Wizardry',
 'stands',
 'as',
 'a',
 'beacon',
 'of',
 'hope',
 'and',
 'learning',
 'for',
 'young',
 'witches',
 'and',
 'wizards',
 '.']

In [17]:
from nltk.corpus import stopwords
swords = stopwords.words('english')
swords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [18]:
word_without_swords=[word for word in words if word not in swords]
word_without_swords

['In',
 'magical',
 'world',
 'Harry',
 'Potter',
 ',',
 'Hogwarts',
 'School',
 'Witchcraft',
 'Wizardry',
 'stands',
 'beacon',
 'hope',
 'learning',
 'young',
 'witches',
 'wizards',
 '.']

In [19]:
x = [word for word in words if word.lower() not in swords]
x

['magical',
 'world',
 'Harry',
 'Potter',
 ',',
 'Hogwarts',
 'School',
 'Witchcraft',
 'Wizardry',
 'stands',
 'beacon',
 'hope',
 'learning',
 'young',
 'witches',
 'wizards',
 '.']

In [21]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem('working')

'work'

In [24]:
y=[ps.stem(word) for word in x]
y

['magic',
 'world',
 'harri',
 'potter',
 ',',
 'hogwart',
 'school',
 'witchcraft',
 'wizardri',
 'stand',
 'beacon',
 'hope',
 'learn',
 'young',
 'witch',
 'wizard',
 '.']

In [26]:
from nltk.stem import WordNetLemmatizer
wnl  = WordNetLemmatizer()
wnl.lemmatize('working', pos = 'v')

'work'

In [32]:
print(ps.stem('went'))
print(wnl.lemmatize('went',pos = 'v'))

went
go


In [34]:
z= [wnl.lemmatize(word , pos = 'v') for word in x]
z

['magical',
 'world',
 'Harry',
 'Potter',
 ',',
 'Hogwarts',
 'School',
 'Witchcraft',
 'Wizardry',
 'stand',
 'beacon',
 'hope',
 'learn',
 'young',
 'witch',
 'wizards',
 '.']

In [35]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [37]:
t= [word for word in words if word not in string.punctuation]
t

['In',
 'the',
 'magical',
 'world',
 'of',
 'Harry',
 'Potter',
 'Hogwarts',
 'School',
 'of',
 'Witchcraft',
 'and',
 'Wizardry',
 'stands',
 'as',
 'a',
 'beacon',
 'of',
 'hope',
 'and',
 'learning',
 'for',
 'young',
 'witches',
 'and',
 'wizards']

In [38]:
from nltk import pos_tag
pos_tag(t)

[('In', 'IN'),
 ('the', 'DT'),
 ('magical', 'JJ'),
 ('world', 'NN'),
 ('of', 'IN'),
 ('Harry', 'NNP'),
 ('Potter', 'NNP'),
 ('Hogwarts', 'NNP'),
 ('School', 'NNP'),
 ('of', 'IN'),
 ('Witchcraft', 'NNP'),
 ('and', 'CC'),
 ('Wizardry', 'NNP'),
 ('stands', 'VBZ'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('beacon', 'NN'),
 ('of', 'IN'),
 ('hope', 'NN'),
 ('and', 'CC'),
 ('learning', 'NN'),
 ('for', 'IN'),
 ('young', 'JJ'),
 ('witches', 'NNS'),
 ('and', 'CC'),
 ('wizards', 'NNS')]

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
v= tfidf.fit_transform(t)
v.shape

(26, 21)

In [41]:
import pandas as pd
pd.DataFrame(v)

Unnamed: 0,0
0,"(0, 7)\t1.0"
1,"(0, 14)\t1.0"
2,"(0, 9)\t1.0"
3,"(0, 19)\t1.0"
4,"(0, 10)\t1.0"
5,"(0, 4)\t1.0"
6,"(0, 11)\t1.0"
7,"(0, 5)\t1.0"
8,"(0, 12)\t1.0"
9,"(0, 10)\t1.0"
