# A Gentle Introduction to NLP
---------------
## What is Natural Language Processing(NLP)?
```
NLP is a branch of AI that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages.
```

# Applications of NLP

* Machine translation
* Natural language generation
* Web Search
* Spam filters
* Sentiment Analysis
* Chatbots
* ...

-----------------------------------------------------------------------------
# Pipeline for a simple binary classification on twitter dataset
----------------------
### 1. Loading Dataset
### 2. Dataset Cleaning
### 3. Data Vectorization
### 4. Train a Model
### 5. Evaluation
### 6. Checking overfiting and underfiting's using Cross-validations and bias-varians tradeoff

----------------------------------------------------------------------------------------------------------
------

# 1 . Loading Dataset 

In [1]:
#pandas library for working with dataframes 
import pandas as pd

#loading dataset
train = pd.read_csv("train_df.csv")
test = pd.read_csv("test_df.csv")
# shape of dataset
print("Size of train-set is : {}".format(train.shape[0]))
print("Size of test-set is : {}".format(test.shape[0]))

#show a few lines of dataframe
train.head()

Size of train-set is : 700
Size of test-set is : 300


Unnamed: 0,labels,tweets
0,pos,Now all @Apple has to do is get swype on the i...
1,pos,@Apple will be adding more carrier support to ...
2,pos,Hilarious @youtube video - guy does a duet wit...
3,pos,@RIM you made it too easy for me to switch to ...
4,pos,I just realized that the reason I got into twi...


In [2]:
# print few tweets to see what is look like the tweets
sample_no = 10
samples = train.sample(sample_no)
for i in range(sample_no):
    print("[tweet]>>> {}\n[label]>>>{}".format(samples['tweets'].tolist()[i], samples['labels'].tolist()[i]))
    print("====================")
tweets = samples['tweets'].tolist()

[tweet]>>> RT @YashasJoshi: #android 4.0!!  Ice Cream Sandwich will now make all other smartphone operating systems look outdated.  #Androidbeam Wo ...
[label]>>>pos
[tweet]>>> Had ma Ipas not 24 hours an I jailbroke it...now its SHAGGED itunes won't letme restore it @apple SYM!!!!
[label]>>>neg
[tweet]>>> Interesting... @apple now requires you to have a reservation ?         #apple #iphone #4S    :  http://t.co/zZK4fTii
[label]>>>neg
[tweet]>>> Been on hold with @apple customer service for 25 minutes. Wow, lt's like they're #timewarnercable.
[label]>>>neg
[tweet]>>> RT @JDougyJ: hey @apple do you wanna fix my fucking playlist order, you piece of shit
[label]>>>neg
[tweet]>>> The new Nexus looks good! UI feel a bit confusing but I guess Android users are ok with that! Now, try it :) #Google #android #nexusprime
[label]>>>pos
[tweet]>>> @Apple #iOS5 gm on ipad1 is very slow wash better on beta7/8 solve this problem, or give me the ipad2
[label]>>>neg
[tweet]>>> Just apply for a job at @

# 2. Cleaning text

### What is text cleaning goal?
* Upper case and lower cases depend on your feature extractor and in most of time except n-gram feature extraction on char level you don't need to set all of them to upper case or lower case!
* stopword removal, the words that appear mostly and doesn't convey any contextual meaning or pattern for classification
* punchuations, and characters like ‘d’, ‘r’, ‘a’, ‘e’ don’t hold any context individually
* in tweeter dataset: #'s, @'s, url's and numbers's do they have a meaning?
* and  ..

In [3]:
tweets

['RT @YashasJoshi: #android 4.0!!  Ice Cream Sandwich will now make all other smartphone operating systems look outdated.  #Androidbeam Wo ...',
 "Had ma Ipas not 24 hours an I jailbroke it...now its SHAGGED itunes won't letme restore it @apple SYM!!!!",
 'Interesting... @apple now requires you to have a reservation ?         #apple #iphone #4S    :  http://t.co/zZK4fTii',
 "Been on hold with @apple customer service for 25 minutes. Wow, lt's like they're #timewarnercable.",
 'RT @JDougyJ: hey @apple do you wanna fix my fucking playlist order, you piece of shit',
 'The new Nexus looks good! UI feel a bit confusing but I guess Android users are ok with that! Now, try it :) #Google #android #nexusprime',
 '@Apple #iOS5 gm on ipad1 is very slow wash better on beta7/8 solve this problem, or give me the ipad2',
 'Just apply for a job at @Apple, hope they call me lol',
 'Smashed yet another #iPhone screen. When are @apple gonna make the phone more durable!!! Tired of fixing it myself!!!',
 'N

**Lowercased**
![](images/lowercase.png)

In [4]:
sample1 = tweets[0]
# lowerize the text
sample1 = sample1.lower()
print(sample1)

rt @yashasjoshi: #android 4.0!!  ice cream sandwich will now make all other smartphone operating systems look outdated.  #androidbeam wo ...


In [5]:
#removing #'s @'s and ... related to a tweeet
#URLs - Hashtags - Mentions - Reserved words (RT, FAV) - Emojis - Smileys
import preprocessor as p
sample1 = p.clean(sample1)
print(sample1)

rt : !! ice cream sandwich will now make all other smartphone operating systems look outdated. wo ...


**tokenize text to remove stop words**

In [6]:
#a word tokenizer
import nltk
tokens = nltk.word_tokenize(sample1)
print(tokens)

['rt', ':', '!', '!', 'ice', 'cream', 'sandwich', 'will', 'now', 'make', 'all', 'other', 'smartphone', 'operating', 'systems', 'look', 'outdated', '.', 'wo', '...']


In [7]:
#stopword list 
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print("The size of stopwords list:", len(stop_words))
print("English StopWords::\n",stop_words)

The size of stopwords list: 179
English StopWords::
 {'has', 'further', 'up', 'will', 'hasn', 'wouldn', 'shan', 'because', 'its', 'against', 'the', 'or', "won't", 'there', "wouldn't", 'ours', 'myself', 'hers', 'nor', "she's", 'in', 'am', 'couldn', 'i', 'all', "should've", 've', 'needn', 'weren', "shan't", 'ma', 'such', "couldn't", 'm', 'ain', 'while', 'under', 'mustn', 'your', 'me', 'that', 'only', 'did', 'was', 'does', 'over', 'is', 'll', 'to', 't', 'hadn', 'you', 'not', 'other', 'should', 'as', 'out', "haven't", 'being', 'those', 're', 'won', "isn't", "needn't", 'yours', 'doesn', 'ourselves', 'do', 'but', 'have', 'into', 'having', 'for', 'most', 'with', 'at', 'off', 'him', 'until', "shouldn't", 'been', 'by', 'theirs', 'after', "hasn't", 'o', 'too', 'which', 'had', 'when', 'and', 'below', 'it', 'our', 'an', 'didn', 'down', 'who', 'few', 'now', 'above', 'of', 'd', "you're", 'are', "don't", "hadn't", 'here', 'her', 'some', 'my', "you've", "that'll", 'than', "mustn't", 'during', 'very', 

**Now stopwords removal from tokenized text - all in one**

In [8]:
import nltk
from nltk.corpus import stopwords

print("SAMPLE:  ",sample1)
tokens = nltk.word_tokenize(sample1)

stop_words = set(stopwords.words('english'))

new_tokens = []
for token in tokens:
    if token not in stop_words:
        new_tokens.append(token)
sample1 = " ".join(new_tokens)
print("CLEANED:   "," ".join(new_tokens))

SAMPLE:   rt : !! ice cream sandwich will now make all other smartphone operating systems look outdated. wo ...
CLEANED:    rt : ! ! ice cream sandwich make smartphone operating systems look outdated . wo ...


**Punchuation Marks Removal**

In [9]:
# remove punctuation from each word
import string
print("SAMPLE::   ", sample1)
#tables of puncuations
table = str.maketrans('', '', string.punctuation)

tokens = nltk.word_tokenize(sample1)

stripped = [token.translate(table) for token in tokens]
stripted = ' '.join(stripped)
print("STRIPPED::",stripted)

SAMPLE::    rt : ! ! ice cream sandwich make smartphone operating systems look outdated . wo ...
STRIPPED:: rt    ice cream sandwich make smartphone operating systems look outdated  wo 


**Stemming**

In [16]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

tokens = nltk.word_tokenize(sample1)
print("tokens before stemming::", tokens)

stem = [stemmer.stem(token) for token in tokens]
print("tokens after stemming::", stem)

tokens before stemming:: ['rt', ':', '!', '!', 'ice', 'cream', 'sandwich', 'make', 'smartphone', 'operating', 'systems', 'look', 'outdated', '.', 'wo', '...']
tokens after stemming:: ['rt', ':', '!', '!', 'ice', 'cream', 'sandwich', 'make', 'smartphon', 'oper', 'system', 'look', 'outdat', '.', 'wo', '...']


## lets put all in together to build a method for preprocessing a tweet

In [17]:
import preprocessor as p
from nltk.corpus import stopwords
import nltk
import string
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)

def preprocessing(tweet):
    tweet = tweet.lower() 
    
    tweet = p.clean(tweet) 
    
    tokens = nltk.word_tokenize(tweet)

    tokens = [ token.translate(table) for token in tokens ]
    
    tokens = [ token.translate(table) for token in tokens]
    
    tokens = [ token for token in tokens if token not in stop_words]
    
    tokens = [stemmer.stem(token) for token in tokens]
    
    return ' '.join(tokens)

In [18]:
tweets

['RT @YashasJoshi: #android 4.0!!  Ice Cream Sandwich will now make all other smartphone operating systems look outdated.  #Androidbeam Wo ...',
 "Had ma Ipas not 24 hours an I jailbroke it...now its SHAGGED itunes won't letme restore it @apple SYM!!!!",
 'Interesting... @apple now requires you to have a reservation ?         #apple #iphone #4S    :  http://t.co/zZK4fTii',
 "Been on hold with @apple customer service for 25 minutes. Wow, lt's like they're #timewarnercable.",
 'RT @JDougyJ: hey @apple do you wanna fix my fucking playlist order, you piece of shit',
 'The new Nexus looks good! UI feel a bit confusing but I guess Android users are ok with that! Now, try it :) #Google #android #nexusprime',
 '@Apple #iOS5 gm on ipad1 is very slow wash better on beta7/8 solve this problem, or give me the ipad2',
 'Just apply for a job at @Apple, hope they call me lol',
 'Smashed yet another #iPhone screen. When are @apple gonna make the phone more durable!!! Tired of fixing it myself!!!',
 'N

In [19]:
for tweet in tweets:
    cleaned = preprocessing(tweet)
    print(cleaned)

rt    ice cream sandwich make smartphon oper system look outdat  wo 
ipa hour jailbrok  shag itun wo nt letm restor sym    
interest  requir reserv  
hold custom servic minut  wow  lt like 
rt  hey wan na fix fuck playlist order  piec shit
new nexu look good  ui feel bit confus guess android user ok   tri
gm ipad1 slow wash better beta78 solv problem  give ipad2
appli job  hope call lol
smash yet anoth screen  gon na make phone durabl    tire fix   
new galaxi nexu  app improv  inbuilt panoram pictur


# Thanks for your attention for the first part of this lecture!

