# Making a word frequencies

Here we want to create a word frequencies. The word frequencies means to know how many times a word appears in a text. The final result will be stored in a dictionary of words. The dataset is in the [NLTK](https://www.nltk.org/) library which contains both positive and negative tweets.
To find the frequencies we need to follow these steps below:

## Plan:
    1- We need to create labels as 1 and 0 with equal length of positive and negative tweets.

    2- Process the tweets by the process function to remove unwanted charactors, tokenizing, and stemming the words. 

    3- Join the list of tokens for each tweets. 

    4- Identify the words which are in positive and negative tweet and label them as 1 and 0 for positive and negative tweets, respectively. And find the frequency of repeated words. 

 

In the first step we need to import the libraries:

In [2]:
import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import numpy as np

### Import the dataset

In [6]:
pos_twt = twitter_samples.strings('positive_tweets.json')
neg_twt = twitter_samples.strings('negative_tweets.json')

twts = pos_twt+neg_twt

In [13]:
type(twts)

list

## Step: 1. Create the labels


In [30]:
labels = np.append( np.ones(len(pos_twt)), np.zeros(len(neg_twt)) )

In [32]:
labels.shape

(10000,)

## Step: 2
After importing the libraries we need to build up a function which counts the number. Before to know the frequencies of the words we need to process the tweets. To do so, the process functions will help us here. The results of processed tweet will be stored in the ***processed_twts*** list

In [21]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import string
nltk.download('stopwords')
stop_wrds_en = stopwords.words('english')

def process_tweet(tweet):
    
    
    # Instantiate stemming class
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    
    tweet = re.sub(r'^RT[\s]+','', tweet)
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    tweet = re.sub(r'@', '', tweet)
    tweet = re.sub(r'@', '', tweet)
    tweet = re.sub(r'\$\w*','', tweet)
    
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    twt_tokened = tokenizer.tokenize(tweet)
    
    clean_twt = []
    
    for word in twt_tokened:
        if (word not in stop_wrds_en and word not in string.punctuation):
            clean_twt.append(word)
    
    stemed_twt = []
    for word in clean_twt:
        word_stemmed = stemmer.stem(word)
        stemed_twt.append(word_stemmed)
    
    return stemed_twt

[nltk_data] Downloading package stopwords to /home/mn/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
processed_twts = []

for tweet in twts:
    processed_twts.append(process_tweet(tweet))

The build frequency functions counts the number of times the word repeats in both positive and negatvie tweets. 

In [52]:
print('The last processed tweet\n',processed_twts[len(processed_twts)-1])

print('The first processed tweet\n', processed_twts[0])

The last processed tweet
 ['eawoman', 'hull', 'support', 'expect', 'misser', 'week', ':-(']
The first processed tweet
 ['followfriday', 'france_int', 'pkuchli', '57', 'milipol_pari', 'top', 'engag', 'member', 'commun', 'week', ':)']


## Step: 3
Since the processed tweets has the list of words we need to join them again to have a sentence for each of tweets. ***processed_twts2*** has the sentence of each tweets.

In [62]:
processed_twts2 = []

for lst_tokens in processed_twts:
    processed_twts2.append( ' '.join(lst_tokens) )

In [65]:
processed_twts2[1]

'lamb 2ja hey jame odd :/ pleas call contact centr 02392441234 abl assist :) mani thank'

## Step: 4
Find the frequency of words in + and - tweets. The result is stored in ***freqs*** dictionary. 

In [68]:
def build_freqs(tweets, labels):
    ylist = np.squeeze(labels).tolist()
    
    freqs = {}
    
    for y, tweet in zip(ylist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1
    return freqs

In [69]:
freqs = build_freqs(processed_twts2, labels)

Let's see the frequency list of words of + and - 

In [73]:
freqs

{('followfriday', 1.0): 25,
 ('france_int', 1.0): 1,
 ('pkuchli', 1.0): 1,
 ('57', 1.0): 2,
 ('milipol_pari', 1.0): 1,
 ('top', 1.0): 32,
 ('engag', 1.0): 7,
 ('member', 1.0): 16,
 ('commun', 1.0): 33,
 ('week', 1.0): 84,
 (':)', 1.0): 3568,
 ('lamb', 1.0): 1,
 ('2ja', 1.0): 1,
 ('hey', 1.0): 76,
 ('jame', 1.0): 7,
 ('odd', 1.0): 2,
 (':/', 1.0): 5,
 ('plea', 1.0): 97,
 ('call', 1.0): 37,
 ('contact', 1.0): 7,
 ('centr', 1.0): 2,
 ('02392441234', 1.0): 1,
 ('abl', 1.0): 8,
 ('assist', 1.0): 1,
 ('mani', 1.0): 33,
 ('thank', 1.0): 621,
 ('despiteoffici', 1.0): 1,
 ('listen', 1.0): 16,
 ('last', 1.0): 47,
 ('night', 1.0): 70,
 ('bleed', 1.0): 2,
 ('amaz', 1.0): 51,
 ('track', 1.0): 5,
 ('scotland', 1.0): 2,
 ('97side', 1.0): 1,
 ('congrat', 1.0): 21,
 ('yeaaah', 1.0): 1,
 ('yipppi', 1.0): 1,
 ('accnt', 1.0): 2,
 ('verifi', 1.0): 2,
 ('rqst', 1.0): 1,
 ('succeed', 1.0): 1,
 ('got', 1.0): 69,
 ('blue', 1.0): 9,
 ('tick', 1.0): 1,
 ('mark', 1.0): 2,
 ('fb', 1.0): 6,
 ('profil', 1.0): 2,
 ('

Some words are repeated in both + and - tweets. As an example see the word ***happi*** from below:

In [79]:
print('Number of times the word happy is appeared in a positive tweet is', 
      freqs.get(('happi', 1)))
print('Number of times the word happy is appeared in a positive tweet is', 
      freqs.get(('happi', 0)))

Number of times the word happy is appeared in a positive tweet is 211
Number of times the word happy is appeared in a positive tweet is 25


### Piece of Mind
Here the goal was to count the frequency of words in positive and negative tweets. If we are give a tweet, how can we know that how many times the words were repeated in a + tweet or a - tweet?

    The answer is simple. We get the tweet and compare it with the frequecy dictionary which is built in Step 4
    
as an example ***sample_tweet*** is taken, processed and the + and - freqeuncies are found for each of the words.

In [94]:
print('The sample tweet is:\n', sample_tweet)
sample_tweet = twts[999]

# processed tweet
proc_twt = process_tweet(sample_tweet)
print('\nThe processed tweet is:\n', proc_twt)

dic_words = []

for word in proc_twt:
    
    pos = 0;
    neg = 0;
    
    # capture the frequency value of the positive word
    if (word, 1) in freqs:
        pos = freqs[(word, 1)]
        
    # capture the frequency value of the negative word
    if (word, 0) in freqs:
        neg = freqs[(word,0)]
    
    # the dic_words shows the number times the word appears as positive and negative
    dic_words.append( [word, pos, neg] )

The sample tweet is:
 *sigh* "@Whykaysbeauty: Bruhhh“@Dopjones: Call me daddy one more time :)”"

The processed tweet is:
 ['sigh', 'whykaysbeauti', 'bruhhh', '“', 'dopjon', 'call', 'daddi', 'one', 'time', ':)', '”']


In [96]:
print('\nHere is the frequency of each words in the tweet which appears in a positive and negative tweets')
dic_words


Here is the frequency of each words in the tweet which appears in a positive and negative tweets


[['sigh', 3, 13],
 ['whykaysbeauti', 1, 0],
 ['bruhhh', 1, 0],
 ['“', 7, 15],
 ['dopjon', 1, 0],
 ['call', 37, 29],
 ['daddi', 2, 4],
 ['one', 129, 150],
 ['time', 127, 166],
 [':)', 3568, 2],
 ['”', 5, 11]]