# **TF-IDF** (+ Linear SVC classification) - Predictions
This notebook loads the fitted tfidf transformer and trained support vector classifier, 
and recreates the same predictions.


## Sources
Uses the scikit-learn library both for obtaining the tf-idf representation with TfidfVectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and for classification with LinearSVC (https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html).
## Reproducibility
After running this notebook, you will obtain the model used for Submission **#109900** on AIcrowd

| Accuracy | F1 |
|:---:|:---:|
| 86.2% | 86.5% |

### Import modules and download dataset

In [1]:
import os 
import wget
import tfidf_models as mod
import pickle
root = 'data/'

os.makedirs(root, exist_ok=True)

seed = 0

# Prepare test set
test_url = 'https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDR5Q3hoWXM4T2FJd1JLenc_ZT1hSXh0/root/content'
test_filename = root + 'test.txt'
wget.download(test_url, test_filename)

test_tweets = []
with open(test_filename, encoding = 'utf-8') as f:
    for line in f:
        sp = line.split(',')
        index = sp[0]
        test_tweets.append(','.join(sp[1:]))

  0% [                                                                            ]      0 / 817297  1% [                                                                            ]   8192 / 817297  2% [.                                                                           ]  16384 / 817297  3% [..                                                                          ]  24576 / 817297  4% [...                                                                         ]  32768 / 817297  5% [...                                                                         ]  40960 / 817297  6% [....                                                                        ]  49152 / 817297  7% [.....                                                                       ]  57344 / 817297  8% [......                                                                      ]  65536 / 817297  9% [......                                                                      ]  73728 / 817297

In [2]:
# Load the fitted vectorizer
vectorizer_filename = root + 'tf-idf_fitted_vectorizer.pkl'
with open(vectorizer_filename, 'rb') as file:
    vect = pickle.load(file)
    
# Load the trained classifier
clf_filename = root + 'tf-idf_trained_linearSVC.pkl'
with open(clf_filename, 'rb') as file:
    clf = pickle.load(file)

In [3]:
X_test = vect.transform(test_tweets)

save_filename = 'submission_tfidf_predictions.csv'
predictions = clf.predict(X_test)
mod.save_pred(save_filename, predictions)