<h1>Text Frequency-Inverse Document Frequency (TF-IDF) Implementation</h1>

In this notebook we implement a TF-IDF vectorizer and use it on two models (classifiers) to get predictions: a Multinomial Naive Bayes and a Support Vector Machine.

In [None]:
# Needed general imports
import csv, pickle
import pandas as pd
import numpy as np

# Sklearn libraries for TF-IDF and specific classifiers (Bayes and SVM)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

First we get the DataFrames we saved before to use them.

In [None]:
with open('outputs/train_neg_proc.pkl', 'rb') as f:
    neg_DF = pickle.load(f)
    
with open('outputs/train_pos_proc.pkl', 'rb') as f:
    pos_DF = pickle.load(f)
    
with open('outputs/test_data_proc.pkl', 'rb') as f:
    test_DF = pickle.load(f)

In [None]:
neg_DF.head()

<h2>TF-IDF</h2>

Now we create the vectorizer. We go with the idea that we do not want the words that appear in less than 5 tweets and in more than 80% of the tweets.

In [None]:
# create the vectoriser
vectorizer = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf =True)

We now need to create a corpus. Our train set is both positive and negative set appended to each other, and our test set is, obviously, the unlabeled part.

To do this, we will append both negative and positive DF, then create a matrix of labels for them.

In [None]:
# put the list of words into a usable format
neg_DF = neg_DF["lemmed"]
pos_DF = pos_DF["lemmed"]
test_DF = test_DF["lemmed"]
neg_DF = pd.DataFrame(neg_DF)
pos_DF = pd.DataFrame(pos_DF)
test_DF = pd.DataFrame(test_DF)
neg_DF["lemmed"] = neg_DF.lemmed.apply(' '.join)
pos_DF["lemmed"] = pos_DF.lemmed.apply(' '.join)
test_DF["lemmed"] = test_DF.lemmed.apply(' '.join)

In [None]:
# we thus know that all the first ones are labeled as -1 and all the others as 1
all_labeled_DF = pd.concat([neg_DF, pos_DF])

In [None]:
# we create the labels
negs = len(neg_DF.index)
poss = len(pos_DF.index)
labels = np.zeros(negs+poss)
labels[0:negs]=-1
labels[negs:negs+poss]=1 

In [None]:
# we train our TF-IDF vectorizer on the training set
train_corpus_tf_idf = vectorizer.fit_transform(all_labeled_DF['lemmed'])
# we fit our TF-IDF vectorizer on the test set
test_corpus_tf_idf = vectorizer.transform(test_DF["lemmed"])

Now that this is done, it is time to test out this vectorizer on the two classifiers mentioned above.

In [None]:
# we create both models
model1 = LinearSVC() # SVM
model2 = MultinomialNB()

In [None]:
# train on the given models
model1.fit(train_corpus_tf_idf,labels)
model2.fit(train_corpus_tf_idf,labels)

In [None]:
# predictions
result1 = model1.predict(test_corpus_tf_idf)
result2 = model2.predict(test_corpus_tf_idf)

Result1 and result2 are the labels predicted for the tweets we got in the test corpus. This means we just have to transform this into a .csv, as shown in the sample submission.

In [None]:
# Converting it to integer for prediction .csv
result1 = [int(x) for x in result1]
result2 = [int(x) for x in result2]

In [None]:
def create_csv(df, filename):
    # Creating the correctly named columns for the .csv
    df['Id'] = df.index + 1
    df['Prediction'] = df[0]
    df = df[['Id', 'Prediction']]
    # Saving prediction to .csv
    df.to_csv('outputs/' + filename, index=False)

In [None]:
# Creating a DataFrame for easy .csv transformation
svm_df = pd.DataFrame(result1)
create_csv(svm_df, 'svm.csv')

In [None]:
# Creating a DataFrame for easy .csv transformation
bayes_df = pd.DataFrame(result2)
create_csv(bayes_df, 'bayes.csv')