### Twitter Sentiment Analysis - TFIDF

The previous `iPython notebook` featured analysis involving the `train_full` dataset, which included only the raw tweet. In this rendition however, we seek to improve our predictive power by including an additional dataset into the mix: the use of the *term frequency, inverse document frequency* metric in Natural Language Processing or more simply as `tfidf`.

We follow the same conventions for datasets as before:
- {set} refers to either train, dev, or test:


    train: You should use this data when building a model
    dev: You should use this data when evaluating a model
    test: You should submit the outputs on this data to Kaggle; the labels are not given
   
- {type} refers to either full, count, tfidf, glove100:
    
    
    full: This contains the raw text of the corresponding tweets, one tweet per line, in the following format:
          sentiment, tweet_id, tweet (newline)
         
where sentiment is the class label (to be predicted), the tweet_id identifies the tweet, and the Tweet-text the raw tweet

In [55]:
import pandas as pd
import numpy as np

import ast
from typing import List

In [56]:
df_train = pd.read_csv("train_tfidf.csv")
print(f"{df_train.shape[0]} training instances")
df_train.head()

159253 training instances


Unnamed: 0,sentiment,tweet_id,tweet
0,neg,1,"[(3083, 0.4135918197208131), (3245, 0.79102943..."
1,neg,2,"[(679, 0.4192120119709425), (1513, 0.523940563..."
2,neg,3,"[(225, 0.5013098541806313), (1480, 0.441928325..."
3,neg,4,"[(1748, 0.5306751425467238), (1811, 0.34289257..."
4,neg,5,"[(1788, 0.568230353269611), (1789, 0.403924230..."


In [61]:
# df_train[tweet] is currently a string, so we first convert to an array and then a sparse matrix
def string_to_array(tfidf_tweet: str) -> List[tuple]:
    try:
        ast.literal_eval(tweet)
    except ValueError:
        pass


df_train['tweet'].apply(lambda x: string_to_array(x))
df_train.head()

Unnamed: 0,sentiment,tweet_id,tweet
0,neg,1,"[(3083, 0.4135918197208131), (3245, 0.79102943..."
1,neg,2,"[(679, 0.4192120119709425), (1513, 0.523940563..."
2,neg,3,"[(225, 0.5013098541806313), (1480, 0.441928325..."
3,neg,4,"[(1748, 0.5306751425467238), (1811, 0.34289257..."
4,neg,5,"[(1788, 0.568230353269611), (1789, 0.403924230..."


### Feature Engineering:

We can combine this `tfidf` representation of the tweet with features engineered from the raw tweet itself, such as tweet length, number of mentions ect.