## Extract Data
From: Chapter 2 of    
Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning using Python   
by Adarsha Shivananda, Akshay Kulkarni   
https://www.safaribooksonline.com/library/view/natural-language-processing/9781484242674/

#### Sample Data

In [1]:
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
df.head(3)

Unnamed: 0,tweet
0,This is introduction to NLP
1,"It is likely to be useful, to people"
2,Machine learning is the new electrcity


#### Applying Basic String Functions to All Entries in a DataFrame   
Eg: Make them lowercase, for easy match later.

In [2]:
# book had this
# df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
# simplified version:
df['tweet'] = df['tweet'].apply(lambda x: x.lower())
df.head(3)

Unnamed: 0,tweet
0,this is introduction to nlp
1,"it is likely to be useful, to people"
2,machine learning is the new electrcity


But it is recommended to avoid apply as much as possible.   
More on this in Ted Petrou's recommendations:
https://github.com/tdpetrou/Learn-Pandas/blob/master/Minimal%20Pandas/Minimally%20Sufficient%20Pandas.ipynb

In [3]:
df['tweet'] = df['tweet'].str.lower()
df.head(3)

Unnamed: 0,tweet
0,this is introduction to nlp
1,"it is likely to be useful, to people"
2,machine learning is the new electrcity


#### Another Example: remove punctiation

In [4]:
df['tweet'] = df['tweet'].str.replace('[^\w\s]','')
df.head(3)

Unnamed: 0,tweet
0,this is introduction to nlp
1,it is likely to be useful to people
2,machine learning is the new electrcity


#### Remove Stop Words
Use the NLTK library, or build your own stop words file.

In [6]:
!pip install nltk
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords_english = stopwords.words('english')



You are using pip version 10.0.1, however version 19.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Oozturk\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
#remove stop words
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stopwords_english))
df.head(3)

Unnamed: 0,tweet
0,introduction nlp
1,likely useful people
2,machine learning new electrcity
