## Lab NLP


# Challenge 1 - Installations-

In [7]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk import sent_tokenize, word_tokenize
from stop_words import get_stop_words
import re
import zipfile
import pandas as pd

In [8]:
text = 'Ironhack is a Global Tech School ranked num 2 worldwide.   Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do. This ideology is reflected in our teaching practices, which consist of a nine-weeks immersive programming, UX/UI design or Data Analytics course as well as a one-week hiring fair aimed at helping our students change their career and get a job straight after the course. We are present in 8 countries and have campuses in 9 locations - Madrid, Barcelona, Miami, Paris, Mexico City,  Berlin, Amsterdam, Sao Paulo and Lisbon.'



In [9]:
sent_tokenize(text)

['Ironhack is a Global Tech School ranked num 2 worldwide.',
 'Our mission is to help people transform their careers and join a thriving community of tech professionals that love what they do.',
 'This ideology is reflected in our teaching practices, which consist of a nine-weeks immersive programming, UX/UI design or Data Analytics course as well as a one-week hiring fair aimed at helping our students change their career and get a job straight after the course.',
 'We are present in 8 countries and have campuses in 9 locations - Madrid, Barcelona, Miami, Paris, Mexico City,  Berlin, Amsterdam, Sao Paulo and Lisbon.']

## Challenge 2 - Preparing Text Data For Analysis

In [10]:
ejemplo_str = "@Ironhack's-#Q website 776-is http://ironhack.com [(2018)]\")"
ejemplo_str

'@Ironhack\'s-#Q website 776-is http://ironhack.com [(2018)]")'

In [11]:
def clean_up(s):
    """
    Cleans up numbers, URLs, and special characters from a string.

    Args:
        s: The string to be cleaned up.

    Returns:
        A string that has been cleaned up.
    """
    s = re.sub(r'(?:http:.+\.com)',' ',s)
    s = re.sub(r'\d',' ',s)
    s = re.sub(r'\W',' ',s)
    s = s.lower().strip()
    return s

In [12]:
clean_up(ejemplo_str)

'ironhack s  q website     is'

## Tokenization

In [13]:
def tokenize(s):
    """
    Tokenize a string.

    Args:
        s: String to be tokenized.

    Returns:
        A list of words as the result of tokenization.
    """
    return word_tokenize(s)

In [14]:
tokenize(clean_up(ejemplo_str))

['ironhack', 's', 'q', 'website', 'is']

## Stemming and Lemmatization

In NLTK, there are three stemming libraries: Porter, Snowball, and Lancaster. The difference among the three is the agressiveness with which they perform stemming. Porter is the most gentle stemmer that preserves the word's original form if it has doubts. In contrast, Lancaster is the most aggressive one that sometimes produces wrong outputs. And Snowball is in between. **In most cases you will use either Porter or Snowball**.


In [15]:
def stem_and_lemmatize(l):
    """
    Perform stemming and lemmatization on a list of words.

    Args:
        l: A list of strings.

    Returns:
        A list of strings after being stemmed and lemmatized.
    """
    lemmatizer = WordNetLemmatizer()
    l_n =  []
    for word in l:
        l_n.append(lemmatizer.lemmatize(word,pos='v'))
    return l_n

In [16]:
stem_and_lemmatize(tokenize(clean_up(ejemplo_str)))

['ironhack', 's', 'q', 'website', 'be']


## Stop Words Removal

Stop Words are the most commonly used words in a language that don't contribute to the main meaning of the texts. Examples of English stop words are i, me, is, and, the, but, and here. We want to remove stop words from analysis because otherwise stop words will take the overwhelming portion in our tokenized word list and the NLP algorithms will have problems in identifying the truely important words.

NLTK has a stopwords package that allows us to import the most common stop words in over a dozen langauges including English, Spanish, French, German, Dutch, Portuguese, Italian, etc. These are the bare minimum stop words (100-150 words in each language) that can get beginners started. Some other NLP packages such as stop-words and wordcloud provide bigger lists of stop words.

Now in your Jupyter Notebook, create a function called remove_stopwords that loop through a list of words that have been stemmed and lemmatized to check and remove stop words. Return a new list where stop words have been removed.


In [17]:
stop_words = get_stop_words('en')

In [18]:
def remove_stopwords(l,stop_words):
    """
    Remove English stopwords from a list of strings.

    Args:
        l: A list of strings.

    Returns:
        A list of strings after stop words are removed.
    """
    l = [word for word in l if len(word)>1]
    return [word for word in l if word not in stop_words]

In [19]:
remove_stopwords(stem_and_lemmatize(tokenize(clean_up(ejemplo_str))),stop_words)

['ironhack', 'website']

## Challenge 3: Sentiment Analysis

In [None]:
data_tweets = pd.read_csv('Sentiment140.csv',nrows=10000)

In [21]:
data_tweets.head(3)

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...


In [22]:
# Process data.
def pre_processing(text,stop_words = stop_words):
    #Aplicamos limpieza 
    text_clean = remove_stopwords(stem_and_lemmatize(tokenize(clean_up(text))),stop_words)
    return text_clean

In [23]:
pre_processing(data_tweets.text[0],stop_words)

['switchfoot',
 'zl',
 'awww',
 'bummer',
 'shoulda',
 'get',
 'david',
 'carr',
 'third',
 'day']

In [24]:
data_tweets['text_processed'] = data_tweets['text'].apply(pre_processing)

In [25]:
data_tweets.head(3)

Unnamed: 0,target,id,date,flag,user,text,text_processed
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t...","[switchfoot, zl, awww, bummer, shoulda, get, d..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,"[upset, can, update, facebook, texting, might,..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,"[kenichan, dive, many, time, ball, manage, sav..."



## Creating Bag of Words

The purpose of this step is to create a bag of words from the processed data. The bag of words contains all the unique words in your whole text body (a.k.a. corpus) with the number of occurrence of each word. It will allow you to understand which words are the most important features across the whole corpus.

Also, you can imagine you will have a massive set of words. The less important words (i.e. those of very low number of occurrence) do not contribute much to the sentiment. Therefore, you only need to use the most important words to build your feature set in the next step. In our case, we will use the top 5,000 words with the highest frequency to build the features.

In your Jupyter Notebook, combine all the words in text_processed and calculate the frequency distribution of all words. A convenient library to calculate the term frequency distribution is NLTK's FreqDist class (documentation). Then select the top 5,000 words from the frequency distribution.


In [26]:
all_words = []
for palabras in data_tweets.text_processed:
    all_words = all_words + palabras

In [37]:
uniques = list(set(all_words))
freq_words = []
for unique in uniques:
    freq_words.append([unique,all_words.count(unique)])

In [39]:
freq_words_df = pd.DataFrame(freq_words,columns=['Palabra','Frecuencia'])
freq_words_df = freq_words_df.sort_values('Frecuencia',ascending=False).head(5000)

In [41]:
freq_words_df.head(3)

Unnamed: 0,Palabra,Frecuencia
2961,get,1268
3886,go,1177
8626,work,923


In [47]:
def find_features(document):
    words = set(document.text_processed)
    features = {}
    for w in freq_words_df['Palabra']:
        features[w] = (w in words)
    if document.target == 1:
        target = True
    else:
        target = False
    return (features,target)

In [55]:
features = list(data_tweets.apply(find_features,axis=1))


## Testing Naïve Bayes Model

Now we'll test our classifier with the test dataset. This is done by calling nltk.classify.accuracy(classifier, test).

As mentioned in one of the tutorial videos, a Naive Bayes model is considered OK if your accuracy score is over 0.6. If your accuracy score is over 0.7, you've done a great job!


In [56]:
import random
random.shuffle(features)

In [59]:
train_set, test_set = features[:4000], features[4000:]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [61]:
nltk.classify.accuracy(classifier, test_set)

1.0


# Bonus Question 1 & 2: Improve Model Performance & Machine Learning Pipeline

If you are still not exhausted so far and want to dig deeper, try to improve your classifier performance. There are many aspects you can dig into, for example:

Improve stemming and lemmatization. Inspect your bag of words and the most important features. Are there any words you should furuther remove from analysis? You can append these words to further remove to the stop words list.

Remember we only used the top 5,000 features to build model? Try using different numbers of top features. The bottom line is to use as few features as you can without compromising your model performance. The fewer features you select into your model, the faster your model is trained. Then you can use a larger sample size to improve your model accuracy score.

In a new Jupyter Notebook, combine all your codes into a function (or a class). Your new function will execute the complete machine learning pipeline job by receiving the dataset location and output the classifier. **This will allow you to use your function to predict the sentiment of any tweet in real time**.
