# Fit a Model and Predict Favoriting
This notebook attempts to create a model that can predict whether a tweet was favorited by other Twitter users.

In [23]:
import pandas as pd
import mlutils
from sklearn import svm, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import json
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm


In [24]:
#set random seed
RANDOM_SEED = 655

## Read Cleaned Tweet Dataset

In [25]:
#read file
df = pd.read_json("./intermediate_data/cleaned_tweet_data.json")

In [26]:
df.head()

Unnamed: 0,user_created_at,tweet_full_text,tweet_favorite_count,tweet_created_at,user_name,user_profile_image_url_https,user_profile_sidebar_border_color,user_profile_sidebar_fill_color,user_profile_text_color,user_profile_use_background_image,user_screen_name,user_profile_background_color,user_friends_count,user_followers_count,user_description,user_location,user_location_State,state_political_values,tweet_w_user_descript,tweet_favorited
0,2015-05-08 10:27:51+00:00,done is better than perfect. — sheryl sandberg...,0,2018-09-07 16:25:06+00:00,Ultra YOU Woman,https://pbs.twimg.com/profile_images/597000926...,C0DEED,DDEEF6,333333,True,UltraYOUwoman,C0DEED,48721,57983,i share tips to achieve your health goals and ...,"California, USA",CA,Democrat,done is better than perfect. — sheryl sandberg...,False
1,2008-12-26 09:30:23+00:00,hero fdny likesforlikes promo music instagood ...,0,2018-09-07 16:24:59+00:00,Yung Cut Up (Videos),https://pbs.twimg.com/profile_images/945333114...,FFFFFF,EFEFEF,333333,True,yungcutup,131516,5489,13241,all business inquiries contact cluuxxgmail.com...,"Miami, Florida",FL,Republican,hero fdny likesforlikes promo music instagood ...,False
2,2009-04-17 23:04:15+00:00,just do it 4your morning 4your meme cookie f...,0,2018-09-07 16:24:50+00:00,Rachel Bogle,https://pbs.twimg.com/profile_images/986345956...,FFFFFF,FC6A71,50505,True,rachelbogle,FFFAFF,2386,11377,morning traffic reporter cbs 4indy | traffic a...,"Indianapolis, IN",IN,Republican,just do it 4your morning 4your meme cookie f...,False
3,2010-08-08 02:02:56+00:00,kapernickeffect swoosh justdoit lucas bishop'...,0,2018-09-07 16:24:44+00:00,Ervin Youngblood,https://pbs.twimg.com/profile_images/724407937...,C0DEED,DDEEF6,333333,True,ErvGotti609,C0DEED,965,218,"giants, mets, 7 6ers, penguins, florida state,...",Tennessee by way of New Jersey,TN,Republican,kapernickeffect swoosh justdoit lucas bishop'...,False
5,2008-07-23 16:43:42+00:00,real donald trump it's time for me to stock up...,0,2018-09-07 16:24:35+00:00,tazman69,https://pbs.twimg.com/profile_images/743752426...,C0DEED,DDEEF6,333333,True,tazman69,C0DEED,175,64,"enjoys cycling, running & spending a relaxing ...","Austin, TX",TX,Republican,real donald trump it's time for me to stock up...,False


In [27]:
df_nike = df[df['tweet_full_text'].str.contains("nike")]

## Split Dataset into a Training Set and a Test Set

In [28]:
train_df, dev_df, test_df= \
    np.split(df_nike.sample(frac=1, random_state=RANDOM_SEED),
    [int(.8*len(df_nike)), int(.9*len(df_nike))]
)

## Convert Text Data to Features Using Bigram Vectorizer

In [29]:
bigram_vectorizer = TfidfVectorizer(stop_words='english', min_df=500, ngram_range=(1,2))
X_train = bigram_vectorizer.fit_transform(train_df.tweet_full_text)

y_train = list(train_df.tweet_favorited)

In [30]:
X_train

<796x2 sparse matrix of type '<class 'numpy.float64'>'
	with 1337 stored elements in Compressed Sparse Row format>

## Train a Random Forest Classifier

In [31]:
clf = RandomForestClassifier(max_depth=5, random_state=RANDOM_SEED).fit(X_train, y_train)

## Generate Dev Data

In [32]:
X_dev = bigram_vectorizer.transform(dev_df.tweet_full_text)

y_dev = list(dev_df.tweet_favorited)

## Create Dummy Classifiers
Dummy classifiers are a way to compare a model's results if the regression was done with simple rules. For example, the "most frequent" strategy just picks the most frequent y value in the training set.

In [33]:
dummy_clf_most_frequent = DummyClassifier(strategy="most_frequent", random_state=RANDOM_SEED)
dummy_clf_most_frequent.fit(X_train, y_train)

dummy_clf_uniform = DummyClassifier(strategy="uniform", random_state=RANDOM_SEED)
dummy_clf_uniform.fit(X_train, y_train)


DummyClassifier(random_state=655, strategy='uniform')

## Create Predictions for Dev Set

In [34]:
lr_dev_preds = clf.predict(X_dev)
rand_dev_preds = dummy_clf_uniform.predict(X_dev)
mf_dev_preds = dummy_clf_most_frequent.predict(X_dev)

## Score predictions for Dev Set

In [35]:
lr_f1 = f1_score(y_dev, lr_dev_preds, average='macro')
rand_f1 = f1_score(y_dev, rand_dev_preds, average='macro')
mf_f1 = f1_score(y_dev, mf_dev_preds, average='macro')

In [36]:
print("Model Score:", lr_f1)
print("Dummy Score (random):", rand_f1)
print("Dummy Score (most frequent):", mf_f1)

Model Score: 0.4309460929004194
Dummy Score (random): 0.38636363636363635
Dummy Score (most frequent): 0.3888888888888889


## Results
The random forest model appears to have a f1 score higher than the score generated by either dummy variable. This suggests that there might be some prediction power in a tweet's text in determining if it will favorited. This means that Nike may benefit from looking at favorited tweets closer to see what they have in common. An interesting next step would be to compare it to the insights generated from Section 2, the exploratory analysis What are there topics? How do the people sending the tweets identify themselves and where are they from?

## Next Step
After you run the regression model, run the next step in the workflow [4-FitPredictECResults.ipynb](./4-FitPredictECResults.ipynb) or go back to [0-Workflow.ipynb](./0-Workflow.ipynb).

---

**Author:** [Nick Capaldini](mailto:nickcaps@umich.edu), University of Michigan, January 19, 2022

---