# Data Description

### Company.csv
Just maps Ticker Symbol to company name. Not that useful

### CompanyValues
Holds market values for companies. ticker_symbol, day_ate, and close_value are the important columns

### Company_Tweet
Maps tweet ids to the company they're about

### Tweet
Holds our tweet data. Relevant columns are tweet_id, post_date, body, (like_num/retweet_num could be used to scale based on reach maybe, but for now lets ignore)

# Imports

In [None]:
import pandas as pd

# Load and clean data

In [2]:
company_values_file_path = "./data/CompanyValues.csv"
tweet_file_path = "./data/Tweet.csv"
company_tweet_file_path = "./data/Company_Tweet.csv"

company_values_df = pd.read_csv(company_values_file_path)
tweet_df = pd.read_csv(tweet_file_path)
company_tweet_df = pd.read_csv(company_tweet_file_path)

### Value Data

In [3]:
company_values_df.drop(['volume', 'open_value', 'high_value', "low_value"], axis=1, inplace=True)
company_values_df.head()

Unnamed: 0,ticker_symbol,day_date,close_value
0,AAPL,2020-05-29,317.94
1,AAPL,2020-05-28,318.25
2,AAPL,2020-05-27,318.11
3,AAPL,2020-05-26,316.73
4,AAPL,2020-05-22,318.89


### Tweet Data

In [4]:
print(f"Before filtering, we had {len(tweet_df)} tweets")
tweet_df = tweet_df[tweet_df['like_num'] > 5]
print(len(f"After we have {tweet_df}"))
tweet_df.head()

Before filtering, we had 3717964 tweets
2058


Unnamed: 0,tweet_id,writer,post_date,body,comment_num,retweet_num,like_num
25,550453624258965505,WSJ,1420073345,Jeff Bezos lost $7.4 billion in Amazon's worst...,21,139,57
92,550489146624856064,CNBC,1420081814,"Earlier this month, a mysterious glitch caused...",4,18,17
105,550499176422051840,WSJ,1420084206,Jeff Bezos lost $7.4 billion in Amazon's worst...,17,113,57
315,550679033395306497,seeitmarket,1420127087,"New Post - ""Apple Stock Pullback: Price Target...",0,3,7
343,550690489008394241,IBDinvestors,1420129818,2015 technology forecasts: Wearable technology...,0,8,11


In [13]:
# Now we add the ticker symbol to each row, and drop unneeded columns
merged_tweet_df = tweet_df.merge(company_tweet_df, on='tweet_id', how='left')
merged_tweet_df.drop(['writer', 'comment_num', 'retweet_num', "tweet_id"], axis=1, inplace=True)
merged_tweet_df.head()

Unnamed: 0,post_date,body,like_num,ticker_symbol
0,1420073345,Jeff Bezos lost $7.4 billion in Amazon's worst...,57,AMZN
1,1420081814,"Earlier this month, a mysterious glitch caused...",17,AAPL
2,1420084206,Jeff Bezos lost $7.4 billion in Amazon's worst...,57,AMZN
3,1420127087,"New Post - ""Apple Stock Pullback: Price Target...",7,AAPL
4,1420129818,2015 technology forecasts: Wearable technology...,11,AAPL


In [14]:
# Change the post date column to be in YYYY-MM-DD
merged_tweet_df['post_date'] = pd.to_datetime(merged_tweet_df['post_date'], unit='s').dt.strftime('%Y-%m-%d')
merged_tweet_df.head()

Unnamed: 0,post_date,body,like_num,ticker_symbol
0,2015-01-01,Jeff Bezos lost $7.4 billion in Amazon's worst...,57,AMZN
1,2015-01-01,"Earlier this month, a mysterious glitch caused...",17,AAPL
2,2015-01-01,Jeff Bezos lost $7.4 billion in Amazon's worst...,57,AMZN
3,2015-01-01,"New Post - ""Apple Stock Pullback: Price Target...",7,AAPL
4,2015-01-01,2015 technology forecasts: Wearable technology...,11,AAPL


### Save this filtered data to CSV

In [16]:
merged_tweet_df.to_csv("./data/filtered_tweet_data.csv")
company_values_df.to_csv("./data/filtered_company_values.csv")