What is Alpha?

Alpha

1. Alpha factors are designed to extract signals from data to predict asset returns for a given investment universe over the trading horizon. A factor takes on a single value for each asset when evaluated, but may combine one or several input variables.

2. Alpha factors emit entry and exit signals that lead to buy or sell orders, and order execution results in portfolio holdings. The risk profiles of individual positions interact to create a specific portfolio risk profile. Portfolio management involves the optimization of position weights to achieve the desired portfolio risk and return a profile that aligns with the overall investment objectives. This process is highly dynamic to incorporate continuously-evolving market data.

Signals derived from alpha factors are often individually weak, but sufficiently powerful when combined with other factors or data sources, for example, to modulate the signal as a function of the market or economic context.

# Preprocessing Train Stock Factors and Tweets

## Importing Data and Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
smf=pd.read_csv('train_factors.csv')
json=pd.read_json('train_data.json')

In [3]:
# x = []
# y = []
# z = []
# a = []
# for i in json.records:
#     x.append(i['stocktwit_tweet'])
#     y.append(i['sentiment_score'])
#     z.append(pd.to_datetime(i['timestamp']))
#     a.append(i['ticker'])

# json_data = pd.DataFrame({'Tweet':x, 
#                          'Sentiment_score':y,
#                          'Date':z,
#                          'Stock':a})

In [4]:
# json_data.to_csv('json_df_train.csv',index=False)
json_df=pd.read_csv('json_df_train.csv')

In [5]:
smf.sort_values(by='date').head()

Unnamed: 0,Id,date,ticker,SF1,SF2,SF3,SF4,SF5,SF6,SF7,alpha
13579,13580,01/07/18,$CHK,0.923545,-0.24341,-0.158433,-0.287238,-0.28962,-0.778008,1.019364,3
12721,12722,01/07/18,$EA,-0.633291,-0.490767,-0.100732,0.406717,0.618815,-1.398402,-0.767455,2
5253,5254,01/07/18,$MPC,0.990107,-2.22285,-0.296034,-1.159469,-0.117747,-0.411154,1.636414,1
18031,18032,01/07/18,$RRD,-0.072173,1.129222,-0.595059,-0.920225,1.145811,0.670895,0.6494,2
13028,13029,01/07/18,$WYNN,0.77456,0.942843,-0.500773,-1.594142,0.388954,-0.264942,1.764325,3


## Data Exploring and Cleaning

In [6]:
smf['date']=pd.to_datetime(smf['date'].astype(str),format="%d/%m/%y")

In [7]:
smf.sort_values(by='date').head()

Unnamed: 0,Id,date,ticker,SF1,SF2,SF3,SF4,SF5,SF6,SF7,alpha
25420,25421,2018-07-01,$FBHS,-0.127743,-0.662712,0.524711,-0.356265,-0.962408,-0.349525,-0.041011,1
19929,19930,2018-07-01,$CBS,0.52031,0.612894,-0.047482,-0.062071,-0.238253,-1.047546,0.503267,2
10725,10726,2018-07-01,$NKE,0.854953,-0.598153,0.356626,-0.360109,-1.252869,-0.376943,0.863024,4
25200,25201,2018-07-01,$HOLX,-0.235685,-0.284623,0.051159,0.700647,0.091861,0.992291,-0.63823,2
12721,12722,2018-07-01,$EA,-0.633291,-0.490767,-0.100732,0.406717,0.618815,-1.398402,-0.767455,2


In [8]:
json_df['Date_T']=json_df['Date'].str.split(expand=True)[0]
json_df['Time']=json_df['Date'].str.split(expand=True)[1]
json_df['Date_T']=pd.to_datetime(json_df['Date_T'])
json_df.drop(['Date','Time'],inplace=True,axis=1)

In [9]:
json_df.sort_values(by='Date_T').head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T
72094,$CLF 9 Monday,4,$CLF,2018-07-01
588364,$AAPL absolutely true. And throwing more bodie...,2,$AAPL,2018-07-01
109383,insider ownership of #Davita is 2.16% $DVA #DV...,2,$DVA,2018-07-01
109430,$SBUX the chart look so ugly without a clear s...,1,$SBUX,2018-07-01
894168,$AMD this board needs to be more active on wee...,2,$AMD,2018-07-01


### Removing Duplicated Rows

In [10]:
json_df.duplicated(keep=False).sum()

43312

In [11]:
json_df.duplicated().sum()

29173

In [12]:
json_df.drop_duplicates(keep='first',inplace=True)

In [13]:
json_df=json_df.reset_index(drop=True)
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T
0,$AMD going up but hesitating however chart is ...,3,$AMD,2018-09-19
1,@inforlong @MariaGascon Despite\nChina trade w...,3,$CAT,2018-10-09
2,$AVGO WTF?,2,$AVGO,2018-07-12
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,$PH,2018-07-19
4,$FB if it bounces tommorrow do the right thing...,3,$FB,2018-08-23


### Cleaning Stock Ticker SMF

In [14]:
smf['ticker']=smf['ticker'].str.replace('$','')
smf['ticker']=smf['ticker'].str.upper()

### Cleaning Stock Ticker JSON

In [15]:
json_df['Stock']=json_df['Stock'].str.replace('$','')
json_df['Stock']=json_df['Stock'].str.upper()

In [16]:
smf.head()

Unnamed: 0,Id,date,ticker,SF1,SF2,SF3,SF4,SF5,SF6,SF7,alpha
0,1,2018-08-21,NTAP,-0.628652,0.988891,-0.055714,0.774379,0.551089,-1.329229,-0.995539,2
1,2,2018-10-11,WYNN,1.315786,1.438754,0.187327,0.608933,-1.15303,1.859441,0.730995,3
2,3,2018-08-21,DRI,-1.141388,-1.455016,0.332755,0.674502,0.111326,-0.478597,-1.488157,1
3,4,2018-07-10,GE,-0.054839,-1.454149,-0.162267,-0.68187,0.307869,-0.529987,0.404172,2
4,5,2018-09-12,FE,-0.686366,0.838865,0.07383,0.679024,0.329463,1.262782,-1.024042,2


In [17]:
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09
2,$AVGO WTF?,2,AVGO,2018-07-12
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23


## Feature Engineering TP

### Feature Engineering  For SMF

#### Adding Month

In [18]:
smf['Month']=smf['date'].dt.month

#### Adding Day

In [19]:
smf['Day']=smf['date'].dt.day

#### Adding WeekDay

In [20]:
smf['WeekDay']=smf['date'].dt.weekday

#### Adding WeekNumber

In [21]:
weeknumber=[]
for i in range(smf.shape[0]):
    weeknumber.append(smf['date'][i].isocalendar()[1])
smf['WeekNumber']=weeknumber

### Feature Engineering For JSON

#### Adding Month

In [22]:
json_df['Month']=json_df['Date_T'].dt.month

#### Adding Day

In [23]:
json_df['Day']=json_df['Date_T'].dt.day

#### Adding WeekDay

In [24]:
json_df['WeekDay']=json_df['Date_T'].dt.weekday

#### Adding WeekNumber

In [25]:
weeknumber=[]
for i in range(json_df.shape[0]):
    weeknumber.append(json_df['Date_T'][i].isocalendar()[1])
json_df['WeekNumber']=weeknumber

In [26]:
smf.head()

Unnamed: 0,Id,date,ticker,SF1,SF2,SF3,SF4,SF5,SF6,SF7,alpha,Month,Day,WeekDay,WeekNumber
0,1,2018-08-21,NTAP,-0.628652,0.988891,-0.055714,0.774379,0.551089,-1.329229,-0.995539,2,8,21,1,34
1,2,2018-10-11,WYNN,1.315786,1.438754,0.187327,0.608933,-1.15303,1.859441,0.730995,3,10,11,3,41
2,3,2018-08-21,DRI,-1.141388,-1.455016,0.332755,0.674502,0.111326,-0.478597,-1.488157,1,8,21,1,34
3,4,2018-07-10,GE,-0.054839,-1.454149,-0.162267,-0.68187,0.307869,-0.529987,0.404172,2,7,10,1,28
4,5,2018-09-12,FE,-0.686366,0.838865,0.07383,0.679024,0.329463,1.262782,-1.024042,2,9,12,2,37


In [27]:
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34


## Saving Cleaned SMF Data to CSV

In [28]:
smf.to_csv('smf_cleaned_train_val.csv',index=False)

## Cleaning JSON(Textual Data)

In [29]:
import spacy
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
import unicodedata

In [30]:
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

In [31]:
nlp = spacy.load('en', parse = False, tag=False, entity=False)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

## Removing and Parsing Html Tags

In [32]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

In [33]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_html.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tweet'].apply(strip_html_tags)
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,$AMD going up but hesitating however chart is ...
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,@inforlong @MariaGascon Despite\nChina trade w...
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,$AVGO WTF?
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,$PH\n New Insider Filing On: \n MULLER KLAUS P...
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,$FB if it bounces tommorrow do the right thing...


## Feature Engineering(Continued)

### Feature Engineering adding Punctuation Percentage

In [34]:
import string
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

In [35]:
json_df['Punctuation_%'] = json_df['Tidy_Tweet'].apply(lambda x: count_punct(x))

### Feature Engineering adding Text Length

In [36]:
json_df['Text Length'] = json_df['Tidy_Tweet'].apply(lambda x: len(x) - x.count(" "))

In [37]:
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,$AMD going up but hesitating however chart is ...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,@inforlong @MariaGascon Despite\nChina trade w...,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,$AVGO WTF?,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,$PH\n New Insider Filing On: \n MULLER KLAUS P...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,$FB if it bounces tommorrow do the right thing...,2.2,45


## Cleaning Json Tweet (Continued)

### Removing Http and www from Tweet

In [38]:
# json_df['Tidy_Tweet'] = np.vectorize(remove_pattern)(json_df['Tidy_Tweet'], "http[\w]*") 
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttp.csv',index=False)
json_df['Tidy_Tweet'] = json_df['Tidy_Tweet'].str.replace('http\S*|www.\S*','', case=False)
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,$AMD going up but hesitating however chart is ...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,@inforlong @MariaGascon Despite\nChina trade w...,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,$AVGO WTF?,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,$PH\n New Insider Filing On: \n MULLER KLAUS P...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,$FB if it bounces tommorrow do the right thing...,2.2,45


### Removing Repeating Words in Tweet

In [39]:
json_df['Tidy_Tweet'] = json_df['Tidy_Tweet'].apply(lambda x : re.sub(r'(.)\1{1,}', r'\1\1', x) )

### Converting Emojis to Words

In [40]:
import emoji

In [41]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].apply(lambda x: emoji.demojize(x))

In [42]:
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].str.replace(":", "",regex=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].str.replace("_", " ",regex=False)
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,$AMD going up but hesitating however chart is ...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,@inforlong @MariaGascon Despite\nChina trade w...,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,$AVGO WTF?,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,$PH\n New Insider Filing On \n MULLER KLAUS PE...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,$FB if it bounces tommorrow do the right thing...,2.2,45


### Removing @ Twitter Handles from Tweet

In [43]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@.csv',index=False)
json_df['Tidy_Tweet'] =json_df['Tidy_Tweet'].str.replace("@[\w]*","")
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,$AMD going up but hesitating however chart is ...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,Despite\nChina trade war $CAT held very well...,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,$AVGO WTF?,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,$PH\n New Insider Filing On \n MULLER KLAUS PE...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,$FB if it bounces tommorrow do the right thing...,2.2,45


### Removing dollarsign from Tweet

In [44]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].str.replace("\$[\w]*","")
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,going up but hesitating however chart is very...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,Despite\nChina trade war held very well thu...,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,WTF?,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,\n New Insider Filing On \n MULLER KLAUS PETER...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,if it bounces tommorrow do the right thing an...,2.2,45


### Removing Accented Chars from Tweet

In [45]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [46]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accented.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].apply(remove_accented_chars)
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,going up but hesitating however chart is very...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,Despite\nChina trade war held very well thu...,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,WTF?,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,\n New Insider Filing On \n MULLER KLAUS PETER...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,if it bounces tommorrow do the right thing an...,2.2,45


### Lower Text

In [47]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accentedlower.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].str.lower()
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,going up but hesitating however chart is very...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,despite\nchina trade war held very well thu...,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,wtf?,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,\n new insider filing on \n muller klaus peter...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,if it bounces tommorrow do the right thing an...,2.2,45


### Expanding Contractions

In [48]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [49]:
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].apply(expand_contractions)
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,going up but hesitating however chart is very...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,despite\nchina trade war held very well thu...,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,wtf?,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,\n new insider filing on \n muller klaus peter...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,if it bounces tommorrow do the right thing an...,2.2,45


### Removing Punctuations, Numbers,Special Characters and Unnecessary Hashtags

In [50]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accentedlowercontractspecial.csv',index=False)
json_df['Tidy_Tweet'] = json_df['Tidy_Tweet'].str.replace("[^a-zA-Z#]", " ")

In [51]:
json_df['Tidy_Tweet'] = json_df['Tidy_Tweet'].str.replace("[#+]?\B", "")
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,going up but hesitating however chart is very...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,despite china trade war held very well thum...,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,wtf,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,new insider filing on muller klaus peter t...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,if it bounces tommorrow do the right thing an...,2.2,45


### Removing Stopwords

In [52]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [53]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accentedlowercontractspecialstp.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].apply(remove_stopwords)
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,going hesitating however chart stable going up...,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,despite china trade war held well thumbs,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,wtf,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,new insider filing muller klaus peter transact...,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,bounces tommorrow right thing gtfo,2.2,45


### Stemming Text

In [54]:
# from nltk import WordNetLemmatizer
# lemmatizer = WordNetLemmatizer()
# tokenized_tweet = tokenized_tweet.apply(lambda x: [lemmatizer.lemmatize(i) for i in x])
# json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accentedlowercontractspecialstplemma.csv',index=False)
# json_df.iloc[121411:121414]
tokenized_tweet = json_df['Tidy_Tweet'].apply(lambda x: x.split()) 
tokenized_tweet.head()

0    [going, hesitating, however, chart, stable, go...
1     [despite, china, trade, war, held, well, thumbs]
2                                                [wtf]
3    [new, insider, filing, muller, klaus, peter, t...
4             [bounces, tommorrow, right, thing, gtfo]
Name: Tidy_Tweet, dtype: object

In [55]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x])
tokenized_tweet.head()

0         [go, hesit, howev, chart, stabl, go, upward]
1       [despit, china, trade, war, held, well, thumb]
2                                                [wtf]
3    [new, insid, file, muller, klau, peter, transa...
4               [bounc, tommorrow, right, thing, gtfo]
Name: Tidy_Tweet, dtype: object

In [56]:
for i in range(len(tokenized_tweet)):
    tokenized_tweet[i] = ' '.join(tokenized_tweet[i])
json_df['Tidy_Tweet'] = tokenized_tweet
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,go hesit howev chart stabl go upward,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,despit china trade war held well thumb,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,wtf,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,new insid file muller klau peter transact code,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,bounc tommorrow right thing gtfo,2.2,45


### Check Null Rows and Removing Blank Tweets

In [57]:
temp_checkpoint = json_df
temp_checkpoint.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,go hesit howev chart stabl go upward,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,despit china trade war held well thumb,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,wtf,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,new insid file muller klau peter transact code,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,bounc tommorrow right thing gtfo,2.2,45


In [58]:
test_tidy=temp_checkpoint[temp_checkpoint['Tidy_Tweet']!=""]
test_tidy.isnull().sum()

Tweet              0
Sentiment_score    0
Stock              0
Date_T             0
Month              0
Day                0
WeekDay            0
WeekNumber         0
Tidy_Tweet         0
Punctuation_%      0
Text Length        0
dtype: int64

In [59]:
test_tidy[test_tidy['Tidy_Tweet']==""]

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length


In [60]:
test_tidy.isnull().sum()

Tweet              0
Sentiment_score    0
Stock              0
Date_T             0
Month              0
Day                0
WeekDay            0
WeekNumber         0
Tidy_Tweet         0
Punctuation_%      0
Text Length        0
dtype: int64

In [61]:
json_df=test_tidy

In [62]:
json_df.head()

Unnamed: 0,Tweet,Sentiment_score,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$AMD going up but hesitating however chart is ...,3,AMD,2018-09-19,9,19,2,38,go hesit howev chart stabl go upward,1.6,62
1,@inforlong @MariaGascon Despite\nChina trade w...,3,CAT,2018-10-09,10,9,1,41,despit china trade war held well thumb,5.0,60
2,$AVGO WTF?,2,AVGO,2018-07-12,7,12,3,28,wtf,22.2,9
3,$PH\n New Insider Filing On: \n MULLER KLAUS P...,2,PH,2018-07-19,7,19,3,29,new insid file muller klau peter transact code,11.0,100
4,$FB if it bounces tommorrow do the right thing...,3,FB,2018-08-23,8,23,3,34,bounc tommorrow right thing gtfo,2.2,45


In [63]:
json_df.shape

(986803, 11)

## Saving Cleaned JSON(Textual Data)

In [64]:
json_df.drop('Tweet',axis=1,inplace=True)

In [65]:
json_df.to_csv('json_df_cleaned_train_val.csv',index=False)

## Fin

# Preprocessing Test Stock Factors and Tweets

## Import Data and Libraries

In [2]:
smf=pd.read_csv('test_factors.csv')
json=pd.read_json('test_data.json')

In [3]:
# x = []
# ##y = []
# z = []
# a = []
# for i in json.records:
#     x.append(i['stocktwit_tweet'])
#     #y.append(i['sentiment_score'])
#     z.append(pd.to_datetime(i['timestamp']))
#     a.append(i['ticker'])

# json_data = pd.DataFrame({'Tweet':x, 
#                          #'Sentiment_score':y,
#                          'Date':z,
#                          'Stock':a})

In [4]:
# json_data.to_csv('json_df_test.csv',index=False)
json_df=pd.read_csv('json_df_test.csv')

In [5]:
smf.sort_values(by='date').head()

Unnamed: 0,Id,date,ticker,SF1,SF2,SF3,SF4,SF5,SF6,SF7
4513,274520,01/07/18,$NVDA,-1.248715,0.600227,-0.051088,2.189014,1.019032,0.29441,-2.382914
7000,277007,01/07/18,$KSS,1.297563,-1.492159,0.37898,0.644587,-1.512155,-0.245498,0.641642
11372,281379,01/07/18,$AMAT,0.131374,-2.249649,0.104021,0.610359,-0.247068,-2.268079,-0.278888
9805,279812,01/07/18,$GOOG,0.831799,0.279735,0.419933,0.053564,-1.335868,-2.135489,0.578237
10055,280062,01/07/18,$TSN,0.022721,-1.116382,0.100513,0.343247,-0.188535,-0.999391,-0.212836


## Data Exploring and Cleaning

In [6]:
smf['date']=pd.to_datetime(smf['date'].astype(str),format="%d/%m/%y")

In [7]:
smf.sort_values(by='date').head()

Unnamed: 0,Id,date,ticker,SF1,SF2,SF3,SF4,SF5,SF6,SF7
9048,279055,2018-07-01,$SNPS,0.802727,0.259343,-0.587202,0.542434,0.671571,0.104758,0.534639
7427,277434,2018-07-01,$WHR,-0.142381,1.655428,-0.005262,-0.695668,0.056611,0.233851,0.293441
5216,275223,2018-07-01,$CBRE,-0.838393,-0.662745,1.318567,-2.103852,-2.168325,0.33899,0.169036
2463,272470,2018-07-01,$CNC,-1.81665,-1.97437,0.184403,0.31956,0.801997,-0.217545,-1.823715
1982,271989,2018-07-01,$URI,-0.512415,0.35674,0.636302,-0.866115,-0.969422,-0.768523,-0.101548


In [8]:
json_df['Date_T']=json_df['Date'].str.split(expand=True)[0]
json_df['Time']=json_df['Date'].str.split(expand=True)[1]
json_df['Date_T']=pd.to_datetime(json_df['Date_T'])
json_df.drop(['Date','Time'],inplace=True,axis=1)

In [9]:
json_df.sort_values(by='Date_T').head()

Unnamed: 0,Tweet,Stock,Date_T
177101,$T own 427 shares and buying more on payday th...,$T,2018-07-01
74702,$CHK the 15 min chart from Friday indicating s...,$CHK,2018-07-01
61216,"$AMD the chase continues, hopefully they catch...",$AMD,2018-07-01
13010,Short sale volume (not short interest) for $WP...,$WPX,2018-07-01
191134,Recent $ICE technical alerts: Bollinger Band S...,$ICE,2018-07-01


### Remove Duplicated Rows

In [10]:
json_df.duplicated(keep=False).sum()

6245

In [11]:
json_df.duplicated().sum()

3880

In [12]:
json_df.drop_duplicates(keep='first',inplace=True)

In [13]:
json_df=json_df.reset_index(drop=True)
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T
0,$CELG nothing to be exited about,$CELG,2018-10-25
1,$AMD yall exhaust your buyer on first green ca...,$AMD,2018-07-13
2,$AMD day traders day.,$AMD,2018-09-25
3,$CBS https://tenor.com/wLB8.gif,$CBS,2018-07-27
4,$MU weak price action so far today. Don’t be a...,$MU,2018-07-31


### Cleaning Stock Ticker SMF

In [14]:
smf['ticker']=smf['ticker'].str.replace('$','')
smf['ticker']=smf['ticker'].str.upper()

### Cleaning Stock Ticker JSON

In [15]:
json_df['Stock']=json_df['Stock'].str.replace('$','')
json_df['Stock']=json_df['Stock'].str.upper()

In [16]:
smf.head()

Unnamed: 0,Id,date,ticker,SF1,SF2,SF3,SF4,SF5,SF6,SF7
0,270007,2018-07-21,INTC,-3.062194,1.223466,1.741714,2.279266,-1.323573,-0.274912,-4.504449
1,270008,2018-10-05,CTSH,0.816263,-2.184408,0.157975,-0.264743,-0.836282,0.046276,0.826353
2,270009,2018-10-01,CB,0.401281,0.091604,0.083411,-1.147041,-0.485223,-0.60106,1.012811
3,270010,2018-10-24,CTAS,-0.783521,1.192929,0.813831,-0.368166,-1.113656,-0.553581,-0.683803
4,270011,2018-07-27,INTC,0.796507,0.455341,0.679032,0.354336,-1.799055,0.126153,0.297111


In [17]:
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T
0,$CELG nothing to be exited about,CELG,2018-10-25
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13
2,$AMD day traders day.,AMD,2018-09-25
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31


## Feature Engineering TP

### Feature Engineering  For SMF

#### Adding Month

In [18]:
smf['Month']=smf['date'].dt.month

#### Adding Day

In [19]:
smf['Day']=smf['date'].dt.day

#### Adding WeekDay

In [20]:
smf['WeekDay']=smf['date'].dt.weekday

#### Adding WeekNumber

In [21]:
weeknumber=[]
for i in range(smf.shape[0]):
    weeknumber.append(smf['date'][i].isocalendar()[1])
smf['WeekNumber']=weeknumber

### Feature Engineering For JSON

#### Adding Month

In [22]:
json_df['Month']=json_df['Date_T'].dt.month

#### Adding Day

In [23]:
json_df['Day']=json_df['Date_T'].dt.day

#### Adding WeekDay

In [24]:
json_df['WeekDay']=json_df['Date_T'].dt.weekday

#### Adding WeekNumber

In [25]:
weeknumber=[]
for i in range(json_df.shape[0]):
    weeknumber.append(json_df['Date_T'][i].isocalendar()[1])
json_df['WeekNumber']=weeknumber

In [26]:
smf.head()

Unnamed: 0,Id,date,ticker,SF1,SF2,SF3,SF4,SF5,SF6,SF7,Month,Day,WeekDay,WeekNumber
0,270007,2018-07-21,INTC,-3.062194,1.223466,1.741714,2.279266,-1.323573,-0.274912,-4.504449,7,21,5,29
1,270008,2018-10-05,CTSH,0.816263,-2.184408,0.157975,-0.264743,-0.836282,0.046276,0.826353,10,5,4,40
2,270009,2018-10-01,CB,0.401281,0.091604,0.083411,-1.147041,-0.485223,-0.60106,1.012811,10,1,0,40
3,270010,2018-10-24,CTAS,-0.783521,1.192929,0.813831,-0.368166,-1.113656,-0.553581,-0.683803,10,24,2,43
4,270011,2018-07-27,INTC,0.796507,0.455341,0.679032,0.354336,-1.799055,0.126153,0.297111,7,27,4,30


In [27]:
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31


## Saving Cleaned SMF Data to CSV

In [28]:
smf.to_csv('smf_cleaned_test.csv',index=False)

## Cleaning JSON(Textual Data)

In [29]:
import spacy
import nltk
from nltk.corpus import wordnet as wn
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
import unicodedata

In [30]:
CONTRACTION_MAP = {
"ain't": "is not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I would",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you would",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

In [31]:
nlp = spacy.load('en', parse = False, tag=False, entity=False)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

## Removing and Parsing Html Tags

In [32]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

In [33]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_html.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tweet'].apply(strip_html_tags)
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,$CELG nothing to be exited about
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,$AMD yall exhaust your buyer on first green ca...
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,$AMD day traders day.
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,$CBS https://tenor.com/wLB8.gif
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,$MU weak price action so far today. Don’t be a...


## Feature Engineering(Continued)

### Feature Engineering adding Punctuation Percentage

In [34]:
import string
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

In [35]:
json_df['Punctuation_%'] = json_df['Tidy_Tweet'].apply(lambda x: count_punct(x))

### Feature Engineering adding Text Length

In [36]:
json_df['Text Length'] = json_df['Tidy_Tweet'].apply(lambda x: len(x) - x.count(" "))

In [37]:
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,$CELG nothing to be exited about,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,$AMD yall exhaust your buyer on first green ca...,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,$AMD day traders day.,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,$CBS https://tenor.com/wLB8.gif,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,$MU weak price action so far today. Don’t be a...,3.4,88


## Cleaning Json Tweet (Continued)

### Removing Http and www from Tweet

In [38]:
# json_df['Tidy_Tweet'] = np.vectorize(remove_pattern)(json_df['Tidy_Tweet'], "http[\w]*") 
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttp.csv',index=False)
json_df['Tidy_Tweet'] = json_df['Tidy_Tweet'].str.replace('http\S*|www.\S*','', case=False)
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,$CELG nothing to be exited about,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,$AMD yall exhaust your buyer on first green ca...,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,$AMD day traders day.,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,$CBS,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,$MU weak price action so far today. Don’t be a...,3.4,88


### Removing Repeating Words in Tweet

In [39]:
json_df['Tidy_Tweet'] = json_df['Tidy_Tweet'].apply(lambda x : re.sub(r'(.)\1{1,}', r'\1\1', x) )

### Converting Emojis to Words

In [40]:
import emoji

In [41]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].apply(lambda x: emoji.demojize(x))

In [42]:
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].str.replace(":", "",regex=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].str.replace("_", " ",regex=False)
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,$CELG nothing to be exited about,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,$AMD yall exhaust your buyer on first green ca...,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,$AMD day traders day.,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,$CBS,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,$MU weak price action so far today. Don’t be a...,3.4,88


### Removing @ Twitter Handles from Tweet

In [43]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@.csv',index=False)
json_df['Tidy_Tweet'] =json_df['Tidy_Tweet'].str.replace("@[\w]*","")
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,$CELG nothing to be exited about,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,$AMD yall exhaust your buyer on first green ca...,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,$AMD day traders day.,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,$CBS,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,$MU weak price action so far today. Don’t be a...,3.4,88


### Removing dollarsign from Tweet

In [44]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].str.replace("\$[\w]*","")
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,nothing to be exited about,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,yall exhaust your buyer on first green candle...,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,day traders day.,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,weak price action so far today. Don’t be afra...,3.4,88


### Removing Accented Chars from Tweet

In [45]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [46]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accented.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].apply(remove_accented_chars)
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,nothing to be exited about,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,yall exhaust your buyer on first green candle...,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,day traders day.,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,weak price action so far today. Dont be afrai...,3.4,88


### Lower Text

In [47]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accentedlower.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].str.lower()
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,nothing to be exited about,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,yall exhaust your buyer on first green candle...,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,day traders day.,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,weak price action so far today. dont be afrai...,3.4,88


### Expanding Contractions

In [48]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [49]:
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].apply(expand_contractions)
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,nothing to be exited about,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,yall exhaust your buyer on first green candle...,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,day traders day.,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,weak price action so far today. dont be afrai...,3.4,88


### Removing Punctuations, Numbers,Special Characters and Unnecessary Hashtags

In [50]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accentedlowercontractspecial.csv',index=False)
json_df['Tidy_Tweet'] = json_df['Tidy_Tweet'].str.replace("[^a-zA-Z#]", " ")

In [51]:
json_df['Tidy_Tweet'] = json_df['Tidy_Tweet'].str.replace("[#+]?\B", "")
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,nothing to be exited about,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,yall exhaust your buyer on first green candle...,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,day traders day,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,weak price action so far today dont be afrai...,3.4,88


### Removing Stopwords

In [52]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [53]:
#json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accentedlowercontractspecialstp.csv',index=False)
json_df['Tidy_Tweet']=json_df['Tidy_Tweet'].apply(remove_stopwords)
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,nothing exited,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,yall exhaust buyer first green candle byee,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,day traders day,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,weak price action far today dont afraid go sho...,3.4,88


### Stemming Text

In [54]:
# from nltk import WordNetLemmatizer
# lemmatizer = WordNetLemmatizer()
# tokenized_tweet = tokenized_tweet.apply(lambda x: [lemmatizer.lemmatize(i) for i in x])
# json_df['Tidy_Tweet'].to_csv('tidy_tweet_htmlhttpemo@$accentedlowercontractspecialstplemma.csv',index=False)
# json_df.iloc[121411:121414]
tokenized_tweet = json_df['Tidy_Tweet'].apply(lambda x: x.split()) 
tokenized_tweet.head()

0                                    [nothing, exited]
1    [yall, exhaust, buyer, first, green, candle, b...
2                                  [day, traders, day]
3                                                   []
4    [weak, price, action, far, today, dont, afraid...
Name: Tidy_Tweet, dtype: object

In [55]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x])
tokenized_tweet.head()

0                                         [noth, exit]
1    [yall, exhaust, buyer, first, green, candl, byee]
2                                   [day, trader, day]
3                                                   []
4    [weak, price, action, far, today, dont, afraid...
Name: Tidy_Tweet, dtype: object

In [56]:
for i in range(len(tokenized_tweet)):
    tokenized_tweet[i] = ' '.join(tokenized_tweet[i])
json_df['Tidy_Tweet'] = tokenized_tweet
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,noth exit,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,yall exhaust buyer first green candl byee,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,day trader day,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,weak price action far today dont afraid go sho...,3.4,88


### Check Null Rows and Removing Blank Tweets

In [57]:
temp_checkpoint = json_df
temp_checkpoint.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,noth exit,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,yall exhaust buyer first green candl byee,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,day trader day,11.1,18
3,$CBS https://tenor.com/wLB8.gif,CBS,2018-07-27,7,27,4,30,,23.3,30
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,weak price action far today dont afraid go sho...,3.4,88


In [58]:
test_tidy=temp_checkpoint[temp_checkpoint['Tidy_Tweet']!=""]
test_tidy.isnull().sum()

Tweet            0
Stock            0
Date_T           0
Month            0
Day              0
WeekDay          0
WeekNumber       0
Tidy_Tweet       0
Punctuation_%    0
Text Length      0
dtype: int64

In [59]:
test_tidy[test_tidy['Tidy_Tweet']==""]

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length


In [60]:
test_tidy.isnull().sum()

Tweet            0
Stock            0
Date_T           0
Month            0
Day              0
WeekDay          0
WeekNumber       0
Tidy_Tweet       0
Punctuation_%    0
Text Length      0
dtype: int64

In [61]:
json_df=test_tidy

In [62]:
json_df.head()

Unnamed: 0,Tweet,Stock,Date_T,Month,Day,WeekDay,WeekNumber,Tidy_Tweet,Punctuation_%,Text Length
0,$CELG nothing to be exited about,CELG,2018-10-25,10,25,3,43,noth exit,3.7,27
1,$AMD yall exhaust your buyer on first green ca...,AMD,2018-07-13,7,13,4,28,yall exhaust buyer first green candl byee,9.3,54
2,$AMD day traders day.,AMD,2018-09-25,9,25,1,39,day trader day,11.1,18
4,$MU weak price action so far today. Don’t be a...,MU,2018-07-31,7,31,1,31,weak price action far today dont afraid go sho...,3.4,88
5,"$AMZN continues to grow, specifically in key a...",AMZN,2018-08-04,8,4,5,31,continu grow specif key area like cloud comput...,3.7,108


In [63]:
json_df.shape

(254741, 10)

## Saving Cleaned JSON(Textual Data)

In [64]:
json_df.drop('Tweet',axis=1,inplace=True)

In [65]:
json_df.to_csv('json_df_cleaned_test.csv',index=False)

## Fin