<a href="https://colab.research.google.com/github/IvanWasNotAvailable/StockPricePrediction/blob/main/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1. Prep (Sentiment140)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt

In [2]:
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/training.1600000.processed.noemoticon.csv.zip -P data
!unzip -n -d data data/training.1600000.processed.noemoticon.csv.zip

--2022-12-10 17:29:47--  https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/investigating-sentiment-analysis/data/training.1600000.processed.noemoticon.csv.zip
Resolving nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)... 162.243.189.2
Connecting to nyc3.digitaloceanspaces.com (nyc3.digitaloceanspaces.com)|162.243.189.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85088192 (81M) [application/zip]
Saving to: ‘data/training.1600000.processed.noemoticon.csv.zip’


2022-12-10 17:29:49 (70.8 MB/s) - ‘data/training.1600000.processed.noemoticon.csv.zip’ saved [85088192/85088192]

Archive:  data/training.1600000.processed.noemoticon.csv.zip
  inflating: data/training.1600000.processed.noemoticon.csv  


In [3]:
#Read the tweets
df = pd.read_csv("data/training.1600000.processed.noemoticon.csv",
                names=['polarity', 'id', 'date', 'query', 'user', 'text'],
                encoding='latin-1')
df.head()

Unnamed: 0,polarity,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [4]:
df.polarity = df.polarity.replace({0: 0, 4: 1})
df.polarity.value_counts()

0    800000
1    800000
Name: polarity, dtype: int64

In [5]:
df = df.drop(columns=['id', 'date', 'query', 'user'])
#df.head(100)

In [6]:
df = df.sample(n=100000)
df.polarity.value_counts()

1    50028
0    49972
Name: polarity, dtype: int64

In [8]:
vectorizer = TfidfVectorizer(max_features=2000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
#words_df.head()



In [9]:
X = words_df
y = df.polarity

In [10]:
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)

CPU times: user 2.61 s, sys: 9.89 ms, total: 2.62 s
Wall time: 2.62 s


LinearSVC()

#2. Sentiment Analysis

##2.1 Sentiment of text

In [11]:
# We start using our model analyzing the sentiment of specifc text

pd.set_option("display.max_colwidth", 200)

text = pd.DataFrame({'content': [
    "I love ATIT",
    "I dont like apples!",
    "BASF just released their new product!",
    "Trashy television shows are some of my favorites",
    "I'm not sure how I feel about Christian Klein",
]})
text

Unnamed: 0,content
0,I love ATIT
1,I dont like apples!
2,BASF just released their new product!
3,Trashy television shows are some of my favorites
4,I'm not sure how I feel about Christian Klein


In [12]:
# Put the text through the vectoriser
# transform, not fit_transform, because we already learned all our words
text_vectors = vectorizer.transform(text.content)
text_words_df = pd.DataFrame(text_vectors.toarray(), columns=vectorizer.get_feature_names())
text_words_df.head()



Unnamed: 0,00,000,09,10,100,11,12,13,14,15,...,your,youre,yours,yourself,youtube,yr,yrs,yum,yummy,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
# SVC predictions
text['pred_svc'] = svc.predict(text_words_df)

In [14]:
text

Unnamed: 0,content,pred_svc
0,I love ATIT,1
1,I dont like apples!,0
2,BASF just released their new product!,1
3,Trashy television shows are some of my favorites,1
4,I'm not sure how I feel about Christian Klein,0


##2.2 Sentiment of Tweets

In [15]:
tweets = pd.read_csv('https://raw.githubusercontent.com/IvanWasNotAvailable/StockPricePrediction/main/tweets.csv',names=['text'], skiprows=1)

In [16]:
tweets.head()

Unnamed: 0,text
0,"Sealing Coating Market Outlook by 2029 | BASF, Alumasc Exterior Building Products, BB Fabrication Renaulac, Koster. â€“ PRIZM News - PRIZM News https://t.co/hDuLGktAvn"
1,BASF . Owned .
2,BASF launches first biomass balance automotive coatings in China - Just Auto\n\nhttps://t.co/OghWZwWHdX\n\n#CarbonCredit #CarbonNeutral #NetZero #CarbonFarming #CarbonCapture #CarbonMarket #Clean...
3,NewswireToday / BASF Launches First Biomass Balance Automotive Coatings in China #BASF #Coatings #Biomass #ColorBrite #AirspaceBlue #ReSource #Basecoat #OEMs #Automotive #Trucking #RV - https://t....
4,GPCA 2022: BASF chairman says Green Deal likely to fall short of US IRA in spurring clean energy... https://t.co/YHYtgof1eW https://t.co/0O3ZxtwkTM


In [17]:
tweets.shape

(100, 1)

First we need to **vectorizer** our sentences into numbers, so the algorithm can understand them.

In [18]:
# Count words using transform, not fit_transform, because we already learned all our words
tweets_vectors = vectorizer.transform(tweets.text)
tweets_words_df = pd.DataFrame(tweets_vectors.toarray(), columns=vectorizer.get_feature_names())
tweets_words_df.head()



Unnamed: 0,00,000,09,10,100,11,12,13,14,15,...,your,youre,yours,yourself,youtube,yr,yrs,yum,yummy,yup
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
tweets_words_df.shape

(100, 2000)

In [20]:
# SVC predictions
tweets['polarity'] = svc.predict(tweets_words_df)

In [21]:
tweets

Unnamed: 0,text,polarity
0,"Sealing Coating Market Outlook by 2029 | BASF, Alumasc Exterior Building Products, BB Fabrication Renaulac, Koster. â€“ PRIZM News - PRIZM News https://t.co/hDuLGktAvn",0
1,BASF . Owned .,1
2,BASF launches first biomass balance automotive coatings in China - Just Auto\n\nhttps://t.co/OghWZwWHdX\n\n#CarbonCredit #CarbonNeutral #NetZero #CarbonFarming #CarbonCapture #CarbonMarket #Clean...,1
3,NewswireToday / BASF Launches First Biomass Balance Automotive Coatings in China #BASF #Coatings #Biomass #ColorBrite #AirspaceBlue #ReSource #Basecoat #OEMs #Automotive #Trucking #RV - https://t....,1
4,GPCA 2022: BASF chairman says Green Deal likely to fall short of US IRA in spurring clean energy... https://t.co/YHYtgof1eW https://t.co/0O3ZxtwkTM,1
...,...,...
95,"Today, on International Civil Aviation Day, we want to share more about our sharkskin technology, developed jointly with Lufthansa Technik, which helps reduce carbon emissions. Learn more about ho...",1
96,"An @LSUEngineering machine learning approach to understand, organize industrial production data allows engineers to discover surprising patterns, leading to sustained research partnerships with @e...",1
97,BASF is leveraging over 100 years of experience in precious metals recycling to advance best-in-class battery recycling solutions all while practicing sustainable raw materials sourcing. https://...,1
98,"â€œHigh natural gas prices have created a situation where importing ammonia from overseas was cheaper than manufacturing it ourselves""\n\nBASF announced they would downsize in Europe â€œas quickly...",0


In [22]:
# Count the numbers of positive and negative tweets 
positive = (tweets['polarity'] == 1).sum()
print('Positive Tweets: ', positive)
negative = (tweets['polarity'] == 0).sum()
print('Negative Tweets: ', negative)

Positive Tweets:  81
Negative Tweets:  19


In [23]:
# When trying to predict stockprices we can interpret the sentiment like that. 
if(positive<negative):
  print('Stockprice should go down 📉')
else:
  print('Stockprice should go up 📈')

Stockprice should go up 📈
