In [44]:
import pandas as pd
import numpy as np
import string
import re
import nltk

Text Preprocessing 
In this kernel, we will talk about the basic steps of text preprocessing.

These steps are needed for transferring text from human language to machine-readable format for further processing. We will also discuss text preprocessing tools. After a text is obtained, we start with text normalization. Text normalization includes:

- converting all letters to lower or upper case

- converting numbers into words or removing numbers

- removing punctuations, accent marks and other diacritics

- removing white spaces

- expanding abbreviations

- removing stop words, sparse terms, and particular words

- applying lemmatization

In [45]:
df_train = pd.read_csv("https://raw.githubusercontent.com/Nikhil-V98/Hackathons/main/Machinehack/Sentiment_Analysis/train.csv")
df_test = pd.read_csv("https://raw.githubusercontent.com/Nikhil-V98/Hackathons/main/Machinehack/Sentiment_Analysis/test.csv")

In [46]:
df_train.head(5)

Unnamed: 0,ID,author,Review,Sentiment
0,39467,rayinstirling,Today I'm working on my &quot;Quirky Q&quot; c...,2
1,30154,DirtyRose17,@ShannonElizab dont ya know? people love the h...,1
2,16767,yoliemichelle,ughhh rejected from the 09 mediation program. ...,0
3,9334,jayamelwani,@petewentz im so jealous. i want an octo drive,0
4,61178,aliisanoun,I remember all the hype around this movie when...,0


In [47]:
y = df_train['Sentiment']
y

0        2
1        1
2        0
3        0
4        0
        ..
44095    2
44096    2
44097    2
44098    2
44099    0
Name: Sentiment, Length: 44100, dtype: int64

In [48]:
train = df_train.copy()
test = df_test.copy()

train.drop(['ID','author'],axis =1,inplace=True)
test.drop(['ID','author'],axis =1,inplace=True)

In [49]:
train.head(5)

Unnamed: 0,Review,Sentiment
0,Today I'm working on my &quot;Quirky Q&quot; c...,2
1,@ShannonElizab dont ya know? people love the h...,1
2,ughhh rejected from the 09 mediation program. ...,0
3,@petewentz im so jealous. i want an octo drive,0
4,I remember all the hype around this movie when...,0


### Lower casing: Converting a word to lower case (NLP -> nlp).

In [50]:
text = train.copy()
for i  in range(text.shape[0]):
    text['Review'][i]=text['Review'][i].lower()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


### To Remove numbers and Punctuation

In [51]:
for i  in range(text.shape[0]):
  text['Review'][i] = re.sub("[^a-zA-Z]", " ",text['Review'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [53]:
text.head()

Unnamed: 0,Review,Sentiment
0,today i m working on my quot quirky q quot c...,2
1,shannonelizab dont ya know people love the h...,1
2,ughhh rejected from the mediation program ...,0
3,petewentz im so jealous i want an octo drive,0
4,i remember all the hype around this movie when...,0


### Tokenization and Remove default stopwords: 
- Tokenization: Splitting the sentence into words.
- Stop words are very commonly used words (a, an, the, etc.)

In [57]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

for i  in range(text.shape[0]):
  stop_words = set(stopwords.words("english")) 
  word_tokens = word_tokenize(text['Review'][i])
  text['Review'][i] = [word for word in word_tokens if word not in stop_words]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


### Stemming and Lemmatization:
- Stemming: From Stemming we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added.
- Lemmatization: As stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words.

### example for Using function in dataframe 

In [None]:
import re
from nltk.corpus import stopwords
import pandas as pd

def preprocess(raw_text):

    # keep only words
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and split 
    words = letters_only_text.lower().split()

    # remove stopwords
    stopword_set = set(stopwords.words("english"))
    meaningful_words = [w for w in words if w not in stopword_set]

    # join the cleaned words in a list
    cleaned_word_list = " ".join(meaningful_words)

    return cleaned_word_list

def process_data(dataset):
    tweets_df = pd.read_csv(dataset,delimiter='|',header=None)

    num_tweets = tweets_df.shape[0]
    print("Total tweets: " + str(num_tweets))

    cleaned_tweets = []
    print("Beginning processing of tweets at: " + str(datetime.now()))

    for i in range(num_tweets):
        cleaned_tweet = preprocess(tweets_df.iloc[i][1])
        cleaned_tweets.append(cleaned_tweet)
        if(i % 10000 == 0):
            print(str(i) + " tweets processed")

    print("Finished processing of tweets at: " + str(datetime.now()))
    return cleaned_tweets

cleaned_data = process_data("tweets.csv)