# Twitter Sentiment Analysis - POC


## Project Structure


## 1. Project Definition

For both structure and definition, see the sentiment140 series.

## 2. Preliminary EDA

In [1]:
import os
import time
import numpy as np
import pandas as pd 

In [2]:
def load_deduped_data():
    """Loads most recent deduped version.
    """
    dirpath = os.path.join("..","data","1.2_deduped","tweets") 
    filename = sorted(os.listdir(dirpath), reverse=True)[0]
    filepath = os.path.join(dirpath, filename)
    df = pd.read_csv(filepath, usecols=[0,2,3,4])
    return df

In [3]:
raw = load_deduped_data()

### Preview data & transform to match sentiment140

In [4]:
raw.head()

Unnamed: 0,ID,User,Text,Polarity
0,1302406168791470081,asarinanamis,RT @Ayshiun: Totally inspired by@/kianamaiart'...,-1
1,1302406168766078976,xleave_thecity,@thiinkinaboutit thank you 🥺,-1
2,1302406168170696706,MsTam_Tam,some of yall retweets really be having me look...,-1
3,1302406168120365056,_thebdawkk,@jeenbeen__ Thank you Jen 🥺 just miss the old ...,-1
4,1302406167956602880,4ranghae1015,"@SJofficial My favorite part is the BAD boy, g...",-1


In [5]:
raw.loc[(raw.Polarity == -1),'Polarity'] = 0

raw = raw.rename(columns={"User":"username","Text":"tweet","Polarity":"target"})

raw.head()

Unnamed: 0,ID,username,tweet,target
0,1302406168791470081,asarinanamis,RT @Ayshiun: Totally inspired by@/kianamaiart'...,0
1,1302406168766078976,xleave_thecity,@thiinkinaboutit thank you 🥺,0
2,1302406168170696706,MsTam_Tam,some of yall retweets really be having me look...,0
3,1302406168120365056,_thebdawkk,@jeenbeen__ Thank you Jen 🥺 just miss the old ...,0
4,1302406167956602880,4ranghae1015,"@SJofficial My favorite part is the BAD boy, g...",0


### Target distribution

In [6]:
raw[["target","ID"]].groupby("target").count()

Unnamed: 0_level_0,ID
target,Unnamed: 1_level_1
0,488341
1,550509


### Missing values

In [7]:
raw.isnull().sum()

ID          0
username    0
tweet       0
target      0
dtype: int64

### Split X, y

Here I just split my data into a feature matrix X and its target vector y:

In [8]:
X = raw.iloc[:,:3]
y = raw.iloc[:,3:4]

---

## 3. Split into Trainining & Test Sets

In [9]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Set Test set aside

Save training and test sets, and indices as well (although arguably the procedures is reproducible by setting the same random seed above).

In [10]:
X_train.head()

Unnamed: 0,ID,username,tweet
114966,1305664535454511105,vooshamii,@teendtreythread Hell yes 😩💕
808955,1313654371972857856,NfcSujith,RT @NivinTwiter: 5 days to go for @NivinOffici...
704720,1312628140229849088,Rimjhim18331641,@iamheneral Me😊😊😊😊😉\n\nAlways there😋😊
804340,1313630256750714880,_VANEdoesit,“Vas a venir pa diciembre” messages are starti...
549083,1310310660463624198,I_amTrey,Me watching the @AtlantaFalcons game right now...


In [11]:
y_train.head()

Unnamed: 0,target
114966,0
808955,1
704720,1
804340,0
549083,0


In [15]:
# save train and test indices
deduped_path = os.path.join("..","data","1.2_deduped","tweets","prepared") 

try:
    os.stat(deduped_path)
except:
    os.mkdir(deduped_path)
    
trainpath = os.path.join(deduped_path, "train_ix.npy")
testpath = os.path.join(deduped_path, "test_ix.npy")

with open(trainpath, 'wb') as f:
    np.save(f, np.array(y_train.index))
    
with open(testpath, 'wb') as f:
    np.save(f, np.array(y_test.index))

In [16]:
# Save train and test sets
# Note: index=True does not preserve split indices
X_train.to_csv(os.path.join(deduped_path, "X_train.csv"), index=False) 
X_test.to_csv(os.path.join(deduped_path, "X_test.csv"), index=False)
y_train.to_csv(os.path.join(deduped_path, "y_train.csv"), index=False)
y_test.to_csv(os.path.join(deduped_path, "y_test.csv"), index=False)

---