# Converting Text Data to Lowercase

#  I'm going to discuss the following recipes under text preprocessing and exploratory data analysis.

Recipe 1. Lowercasing
Recipe 2. Punctuation removal
Recipe 3. Stop words removal
Recipe 4. Text standardization
Recipe 5. Spelling correction
Recipe 6. Tokenization
Recipe 7. Stemming
Recipe 8. Lemmatization
Recipe 9. Exploratory data analysis
Recipe 10. End-to-end processing pipeline

***Let’s create a list of strings and assign it to a variable.***

In [1]:
text = ['I am annie', 'I read in 12',
        'I love NLP', 'I learning NLP in 2 weeks', 
        'python is the best', 'R is good langauage', 
        'I like this book','I wanna more books']

In [2]:
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                       tweet
0                 I am annie
1               I read in 12
2                 I love NLP
3  I learning NLP in 2 weeks
4         python is the best
5        R is good langauage
6           I like this book
7         I wanna more books


In [3]:
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower()
for x in x.split()))
df['tweet']

0                   i am annie
1                 i read in 12
2                   i love nlp
3    i learning nlp in 2 weeks
4           python is the best
5          r is good langauage
6             i like this book
7           i wanna more books
Name: tweet, dtype: object

In [4]:
text2 = ['I am annie', 'I read in 12',
        'I love NLP', 'I learning NLP in 2 weeks', 
        'python is the best', 'R is good langauage', 
        'I like this book','I wanna more books']

In [5]:
df = pd.DataFrame({'tweet2': text2})
df['tweet_lower'] = df['tweet2'].str.lower()
df

Unnamed: 0,tweet2,tweet_lower
0,I am annie,i am annie
1,I read in 12,i read in 12
2,I love NLP,i love nlp
3,I learning NLP in 2 weeks,i learning nlp in 2 weeks
4,python is the best,python is the best
5,R is good langauage,r is good langauage
6,I like this book,i like this book
7,I wanna more books,i wanna more books


In [6]:
text2 = 'In publishing and graphic design, Lorem ipsum is a placeholder text commonly used to demonstrate'

df2= text2.lower()
df2

'in publishing and graphic design, lorem ipsum is a placeholder text commonly used to demonstrate'

# Removing Punctuation

In this recipe😉, I'm going to discuss how to remove punctuation from the
text data. This step is very important as punctuation doesn’t add any extra
information or value. Hence removal of all such instances will help reduce
the size of the data and increase computational efficiency.

In [7]:
text = ['I am annie.', 'I read in 12!',
        'I love NLP$', 'I learning NLP in 2 weeks*', 
        'python is the best/', 'R is good langauage@', 
        'I like this book!','I wanna more books.']

In [8]:
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                        tweet
0                 I am annie.
1               I read in 12!
2                 I love NLP$
3  I learning NLP in 2 weeks*
4         python is the best/
5        R is good langauage@
6           I like this book!
7         I wanna more books.


In [9]:
df['tweet'] = df['tweet'].str.replace('[^\w\s]','')
df['tweet']

  """Entry point for launching an IPython kernel.


0                   I am annie
1                 I read in 12
2                   I love NLP
3    I learning NLP in 2 weeks
4           python is the best
5          R is good langauage
6             I like this book
7           I wanna more books
Name: tweet, dtype: object

# Removing Stop Words




Stop words are very common words that carry no meaning or less meaning compared
to other keywords. If we remove the words that are less commonly used,
we can focus on the important keywords instead.

In [10]:
# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [11]:
text = ['This is introduction to NLP', 'It is likely to be useful, to people',
        'Machine learning is the new electrcity','There would be less hype around AI and more action going forward',
        'python is the best tool', 'R is the good language','I like this Book',
        'I want more books like this.']

In [12]:
#onvert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
df

Unnamed: 0,tweet
0,This is introduction to NLP
1,"It is likely to be useful, to people"
2,Machine learning is the new electrcity
3,There would be less hype around AI and more ac...
4,python is the best tool
5,R is the good language
6,I like this Book
7,I want more books like this.


In [13]:

df['tweet_without_stopwords'] =df['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(df) 

                                               tweet  \
0                        This is introduction to NLP   
1               It is likely to be useful, to people   
2             Machine learning is the new electrcity   
3  There would be less hype around AI and more ac...   
4                            python is the best tool   
5                             R is the good language   
6                                   I like this Book   
7                       I want more books like this.   

                             tweet_without_stopwords  
0                              This introduction NLP  
1                           It likely useful, people  
2                    Machine learning new electrcity  
3  There would less hype around AI action going f...  
4                                   python best tool  
5                                    R good language  
6                                        I like Book  
7                            I want books like this.  


# Standardizing Text

 we are going to discuss how to standardize the text. But before
that, let’s understand what is text standardization and why we need to do it.
Most of the text data is in the form of either customer reviews, blogs, or tweets,
where there is a high chance of people using short words and abbreviations to
represent the same meaning. This may help the downstream process to easily
understand and resolve the semantics of the text.

In [14]:
lookup_dict = {'nlp':'Natural language processing',
'ur':'your', 'wbu': 'what about you'}

In [15]:
import re

def text_std(input_text):
    words = input_text.split()
    new_words = []
    for word in words:
        word = re.sub(r'[^\w\s]',' ',word)
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
            new_words.append(word)
            new_text = " ".join(new_words)
            return new_text

In [16]:
text_std('I like wbu its nlp choice')

'what about you'

# Correcting Spelling

This will help us in reducing multiple copies of words,
which represents the same meaning. For example, “proccessing” and
“processing” will be treated as different words even if they are used in the
same sense.
Note that abbreviations should be handled before this step, or else
the corrector would fail at times. Say, for example, “ur” (actually means
“your”) would be corrected to “or.”

In [17]:
txt = ['This will help us in reducing multiple copies of words',
       'which represents the same meaning','For example, “proccessing” and “processing”',
       'will be treated as different words even if they are used in the same sense.',
       'Note that abbreviations should be handled before this step, or else',
        'the corrector would fail at times.',
       'Say, for example, “ur” (actually means“your”) would be corrected to “or.”']

In [18]:
#convert list to DataFrame
import pandas as pd
df = pd.DataFrame({'tweet': txt})
df

Unnamed: 0,tweet
0,This will help us in reducing multiple copies ...
1,which represents the same meaning
2,"For example, “proccessing” and “processing”"
3,will be treated as different words even if the...
4,Note that abbreviations should be handled befo...
5,the corrector would fail at times.
6,"Say, for example, “ur” (actually means“your”) ..."


In [19]:
from textblob import TextBlob
df['tweet']= df['tweet'].apply(lambda x: str(TextBlob(x).correct()))
df['tweet']

0    His will help us in reducing multiple copies o...
1                    which represents the same meaning
2           For example, “processing” and “processing”
3    will be treated as different words even if the...
4    Note that abbreviations should be handled befo...
5                     the correct would fail at times.
6    May, for example, “or” (actually means“your”) ...
Name: tweet, dtype: object

# Tokenizing Text

Tokenization refers to splitting text into minimal meaningful units. There is a sentence tokenizer
and word tokenizer. We will see a word tokenizer in this recipe, which is
a mandatory step in text preprocessing for any kind of analysis.

In [20]:
import nltk
nltk.word_tokenize(df["tweet"][0])

['His',
 'will',
 'help',
 'us',
 'in',
 'reducing',
 'multiple',
 'copies',
 'of',
 'words']

In [21]:
import pandas as pd
import nltk

#convert list to DataFrame
df = pd.DataFrame({'tweet': txt})
df['tokenized_sents'] = df.apply(lambda x: nltk.word_tokenize(x['tweet']), axis=1)
df

Unnamed: 0,tweet,tokenized_sents
0,This will help us in reducing multiple copies ...,"[This, will, help, us, in, reducing, multiple,..."
1,which represents the same meaning,"[which, represents, the, same, meaning]"
2,"For example, “proccessing” and “processing”","[For, example, ,, “, proccessing, ”, and, “, p..."
3,will be treated as different words even if the...,"[will, be, treated, as, different, words, even..."
4,Note that abbreviations should be handled befo...,"[Note, that, abbreviations, should, be, handle..."
5,the corrector would fail at times.,"[the, corrector, would, fail, at, times, .]"
6,"Say, for example, “ur” (actually means“your”) ...","[Say, ,, for, example, ,, “, ur, ”, (, actuall..."


# Stemming

we will discuss stemming. Stemming is a process of
extracting a root word. For example, “fish,” “fishes,” and “fishing” are stemmed into fish.

In [22]:
text=['I like fishing','I am eating fish','There are many fishes in pound']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                            tweet
0                  I like fishing
1                I am eating fish
2  There are many fishes in pound


In [23]:
#Import library
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['tweet_stemming'] = df['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for
word in x.split()]))
df

Unnamed: 0,tweet,tweet_stemming
0,I like fishing,I like fish
1,I am eating fish,I am eat fish
2,There are many fishes in pound,there are mani fish in pound


# Lemmatizing

Lemmatization is a process of
extracting a root word by considering the vocabulary. For example, “good,”
“better,” or “best” is lemmatized into good.
The part of speech of a word is determined in lemmatization. It will
return the dictionary form of a word, which must be a valid word while
stemming just extracts the root word.

In [24]:
text=['I like fishing','I eating fish','There are many fishes in pound', 'leaves and leaf']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                            tweet
0                  I like fishing
1                   I eating fish
2  There are many fishes in pound
3                 leaves and leaf


In [25]:
#Import library
from textblob import Word
#Code for lemmatize
df['tweet'] = df['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df['tweet']

0                  I like fishing
1                   I eating fish
2    There are many fish in pound
3                   leaf and leaf
Name: tweet, dtype: object