## Twitter Setiment Analysis 

### Part 2: TD - IDF Tuturial with sentiment140 dataset

### Load Cleaned Data

In [2]:
import load_data as ld
df = ld.run_processes()

In [3]:
df.head()

Unnamed: 0,target,text,tokenized,filtered,stemmed
0,0,is upset that he can't update his Facebook by ...,is upset that he cant update his facebook by t...,upset cant update his facebook texting might c...,upset cant updat hi facebook text might cri re...
1,0,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sav...,i dived many times ball managed save 50 rest g...,i dive mani time ball manag save 50 rest go ou...
2,0,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,my whole body feels itchy like fire,my whole bodi feel itchi like fire
3,0,"@nationwideclass no, it's not behaving at all....",no its not behaving at all im mad why am i her...,no not behaving all im mad why am i here becau...,no not behav all im mad whi am i here becaus i...
4,0,@Kwesidei not the whole crew,not the whole crew,not whole crew,not whole crew


### Import ML pre-processing modules

In [4]:
import numpy as np
import pandas as pd

from sklearn.utils import random 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

### Drop NAs created during cleanup

We don't want to impute nor anything else since these are empty texts we cannot use for prediction.

In [5]:
dfm = df.dropna()
dfm.index = range(1,len(dfm) + 1)

### Vectorize with TF IDF

What does that mean?

Explain...

But first, random sample the 1.6M dataset.

#### Random Sample

In [81]:
def random_sample(df):
    """
    Sample 1% without replacement.
    """
    ix = random.sample_without_replacement(n_population=len(df),
                                           n_samples=round(len(df)/100),
                                           random_state=42)
    out = df.loc[ix, ]
    return out

In [126]:
# ensure equal amounts

# divide into negatives and positives
df0 = dfm[dfm['target'] == 0].copy()
df1 = dfm[dfm['target'] == 1].copy()
df1.index = range(0, len(df1))

# sample 1% from each and concatenate
df0_sample = random_sample(df0)
df1_sample = random_sample(df1)
df_sample = pd.concat([df0_sample, df1_sample])
df_sample.index = range(0, len(df_sample))

# counts grouped by target
df_sample.loc[:, ('target','text')].groupby(['target']).count()

Unnamed: 0_level_0,text
target,Unnamed: 1_level_1
0,7983
1,7979


#### TF IDF vector

In [140]:
# instantiate TF IDF vectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True)

y is just an array with the target variable

In [127]:
y = np.array(df_sample.iloc[:, 0]).ravel()
y

array([0, 0, 0, ..., 1, 1, 1], dtype=int64)

In [89]:
# using tokenized feature 
X = vectorizer.fit_transform(np.array(df_sample.iloc[:, 2]).ravel())

In [128]:
# converting to dense format so we can visualize data
col = [i for i in vectorizer.get_feature_names()] 
temp = pd.DataFrame(X.todense(), columns=col) 
temp.shape

(15962, 19580)

In [125]:
# hunting down nonzero feature row
for i, e in enumerate(temp.loc[:, 'camo']):
    if e != 0:
        print(i, e)

13290 0.7416339690431502


In [118]:
# what that rare non-zero value and surrounding data looks like
temp.iloc[13285:13295, 3010:3020]

Unnamed: 0,camisado,cammy,camo,camp,campaign,campbell,camper,campers,campim,camping
13285,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13287,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13288,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13289,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13290,0.0,0.0,0.741634,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13291,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13292,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13293,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**TfidfVectorizer**

Transforms text into feature vectors that can be used as an input to estimators. When the `fit()` method is called, it creates a dictionary that stores each term in the corpus and its assigned feature index. This dictionary is the vectorizer's `.vocabulary_`.

In [147]:
vectorizer.fit(np.array(df_sample.iloc[:, 1]).ravel())

TfidfVectorizer(sublinear_tf=True)

In [153]:
for ix, doc in enumerate(vectorizer.vocabulary_):
    if ix < 5 or ix > len(vectorizer.vocabulary_)-6:
        print(ix, doc)

0 up
1 way
2 too
3 early
4 in
25893 followando
25894 tbm
25895 intothestreet
25896 1bj
25897 twittts


As opposed to `Countvectorizer`, `TfidfVectorizer` doesn't simply one-hot encode each of these terms as features in a sparse matrix; rather, it assigns **scores** based on the $TF * IDF$ formula.

**Term Frequency (TF)**

TF is the frequecy of a term in a document (a word in a Tweet).

If the word is common (like "the") it appears with high frequency. From [Zipf's law](https://en.wikipedia.org/wiki/Zipf's_law) we learn that very frequent terms are uninformative in linguistics, 
these so-called "stop words" are often removed (as I did in the 'filtered' feature). Therefore we'd like to decrease the weight (or score) assigned to this word.

One problem with implementing TF alone is that rare words in a document may be uninformative in the context of entire corpus (all Tweets), so we want to balance the weight assigned to them in a document with another weight assigned via their frequency in all the documents (the corpus).

Enters...


**Inverse Document Frequency (IDF)**

IDF is 

log of {number of docs in your corpus divided by the number of docs in which this term appears}.