## Practice - Twitter classifier
With the Tweet Corpus from two Twitter accounts (archives from Ariana Grande and Trump)

2) 
3) Set up and fit a linear model and predict which account an input tweet is from and its probability

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import json

### Helper functions
You may find these useful in the lab. Feel free to modify for your needs.

In [36]:
def print_examples(data, probs, label1, label2, n=10):
    percent = lambda x: "{}%".format(round(x*100, 1))
    
    for text, pred in list(zip(data, probs))[:n]:
        print("{}\n{}: {} / {}: {}\n{}".format(
            text,
            label1,
            percent(pred[0]),
            label2,
            percent(pred[1]),
            "-"*50  # to print a line
        ))
        
def predict(model, vectorizer, data, all_predictions=False):
    data = vectorizer.transform(data)
    if all_predictions:
        return model.predict_proba(data)
    else:
        return model.predict(data)

### Cleaning function
Create a simple text cleaning function, as tweets are sensitive to major reformatting. You may experiment with this statement!

In [37]:
def twitter_text_clean(text):
    tokens =  TweetTokenizer().tokenize(str(text).lower())
    stop = stopwords.words("english")
    return " ".join([w for w in tokens if w.lower() not in stop])

### Fetch data from /twitter_data

In [38]:
def tweets(name, test_size=0.1):
    with open("twitter_data/{}.json".format(name)) as f:
        #raw_tweets = json.load(f)
        tweets = [t.get("text") for t in json.load(f)]
        #tweets = list(map(lambda x: x.get("text"), raw_tweets))
        cleaned = [twitter_text_clean(t) for t in tweets]
        #cleaned = list(map(twitter_text_clean, tweets))
        
        return train_test_split(cleaned, test_size=test_size)

In [39]:
# TODO: experiment with this parameter!
"""
initially gather a train and test set of each tweet file.
below, we use train to create another test set to evaluate the model

this means the two test datasets below \
are completely unseen to the model we train
"""
test_split = 0.2
ariana_train, ariana_test = tweets("ariana", test_size=test_split)
trump_train, trump_test = tweets("trump", test_size=test_split)

y = [1]*len(ariana_train) + [0]*len(trump_train)
x = ariana_train + trump_train
print("Train samples: {}".format(len(x)))

X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.1, random_state=4310)

Train samples: 320


### TF-IDF + logistic regression
- Vectorize the tweets (e.g. with Count Vectorizer or TF-IDF Vectorizer).
- Logistic regression is neat because it spits out whether something is true or not.
    - This is exactly what we want in this case, to determine between two types of tweet sources (1 or 0).

In [40]:
vectorizer = TfidfVectorizer()
# define regression and fit
LR = LogisticRegression()
LR.fit(vectorizer.fit_transform(X_train), y_train)

# evaluate by confusion matrix 
y_pred = predict(LR, vectorizer, X_test)
print(confusion_matrix(y_test, y_pred))

[[16  0]
 [ 0 16]]


In [15]:
ariana_prob = predict(LR, vectorizer, ariana_test, all_predictions=True)
print_examples(ariana_test, ariana_prob, "Trump", "Ariana")

@kioshiwarrior u angel omg video made happy . thank u every minute . ’ watchi … https://t.co/aaqtky0fst
Trump: 32.0% / Ariana: 68.0%
--------------------------------------------------
@imhvogue yeeeee thank u ! ! !
Trump: 36.9% / Ariana: 63.1%
--------------------------------------------------
love u much ’ start pls https://t.co/syh5atqhzw
Trump: 15.6% / Ariana: 84.4%
--------------------------------------------------
’ wait give u album month
Trump: 41.8% / Ariana: 58.2%
--------------------------------------------------
https://t.co/nob4qnhpkx
Trump: 4.0% / Ariana: 96.0%
--------------------------------------------------
rt @teamariana : r . e . . full fragrance commercial 🤍 watch : https://t.co/dvwaqrsilm https://t.co/ffk0blmgnq
Trump: 21.8% / Ariana: 78.2%
--------------------------------------------------
love u thankfulll
Trump: 18.9% / Ariana: 81.1%
--------------------------------------------------
congratulations incredible deserving team @tbhits @amnija_ @londonondatrack #po

In [16]:
trump_prob = predict(LR, vectorizer, trump_test, all_predictions=True)
print_examples(trump_test, trump_prob, "Trump", "Ariana")

rt @realdonaldtrump : spoke prime minister @borisjohnson united kingdom . thankful friendship support …
Trump: 77.1% / Ariana: 22.9%
--------------------------------------------------
rt @realdonaldtrump : pelosi holding stimulus , republicans !
Trump: 81.2% / Ariana: 18.8%
--------------------------------------------------
rt @realdonaldtrump : morocco recognized united states 1777 . thus fitting recognize sovereignty western saha …
Trump: 82.9% / Ariana: 17.1%
--------------------------------------------------
rt @whitehouse : president @realdonaldtrump wheels minnesota ! https://t.co/h3hgy0skfc
Trump: 90.8% / Ariana: 9.2%
--------------------------------------------------
rt @realdonaldtrump : https://t.co/wheb2u37mi
Trump: 81.0% / Ariana: 19.0%
--------------------------------------------------
rt @realdonaldtrump : https://t.co/cglqrmhtv4
Trump: 81.0% / Ariana: 19.0%
--------------------------------------------------
rt @whitehouse : " amy coney barrett decide cases based text con