## Model Development

### Approach
- Encode columns
- Normalize and standardize numerical columns
- Split data in testing and training
- Train and Test directly on diffrent models
- Based on model performamces select top 3 models
- Hyperparameter tuning on selected models
- Apply ensemble methods combining tunned top 3 models

In [28]:
import numpy as np
import pandas as pd

In [29]:
df = pd.read_csv('final_reddit_data.csv')

In [30]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,Date,Volume,company,Price Movement,subreddit,text,score,num_comments,comments,Sentiment,mention_count
0,0,2015-09-24,543502,BMWYY,Up,investing,concern spread bmw http www bloomberg com news...,224,149,well honestly car manufacturer involved practi...,Positive,35
1,1,2016-08-24,28760,BMWYY,Down,automotive,tesla model used carbon fiber plastic frame si...,2,8,30 decrease weight result increase efficiency ...,Positive,35


let's go encoding of non-numerical columns so that we can use them in ML models

In [31]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
Date,0
Volume,0
company,0
Price Movement,0
subreddit,0
text,0
score,0
num_comments,0
comments,0


In [32]:
from sklearn.preprocessing import LabelEncoder
le_company = LabelEncoder() # for encoding company column
le_subreddit = LabelEncoder() # for encoding subreddit column

df['company'] = le_company.fit_transform(df['company'])
df['subreddit'] = le_subreddit.fit_transform(df['subreddit'])

sentiment_mapping = {'Positive':2, 'Neutral':1, 'Negative':0} # for encoding Sentiment column

df['Sentiment'] = df['Sentiment'].map(sentiment_mapping)


scaling numerical columns

In [33]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerical_columns = df[['Volume', 'score', 'num_comments', 'mention_count']]
numerical_columns = scaler.fit_transform(numerical_columns)

- We will be doing feature extraction from our textual columns (text, comments)

- we will use TF-IDF vectorizer

- before that we will merge the text and comments columns

In [34]:
df

Unnamed: 0.1,Unnamed: 0,Date,Volume,company,Price Movement,subreddit,text,score,num_comments,comments,Sentiment,mention_count
0,0,2015-09-24,543502,0,Up,2,concern spread bmw http www bloomberg com news...,224,149,well honestly car manufacturer involved practi...,2,35
1,1,2016-08-24,28760,0,Down,0,tesla model used carbon fiber plastic frame si...,2,8,30 decrease weight result increase efficiency ...,2,35
2,2,2017-07-28,37846,0,Up,3,best way buy share foreign company thinking pu...,1,3,get bmw bmwyy american market broker checked l...,2,35
3,3,2018-10-26,101185,0,Up,0,fix bmw,1,1,3m headlight restoration kit,1,35
4,4,2019-03-15,101248,0,Up,2,investing bmw bmwyy v bmw de live u want buy s...,8,3,de ticker german exchange traded germanybucks ...,2,35
...,...,...,...,...,...,...,...,...,...,...,...,...
274,274,2024-07-12,345700,7,Up,4,rivn 70 q 50 past 2 day rivn went since volksw...,24,32,user report total submission 1 first seen wsb ...,1,33
275,275,2024-09-06,652200,7,Down,4,volkswagen vwagy may financial trouble help ri...,7,29,user report total submission 1 first seen wsb ...,2,33
276,276,2024-10-28,402500,7,Up,1,volkswagen plan close least three manufacturin...,690,270,germany getting hammered ukraine russia war lo...,2,33
277,277,2024-10-30,398600,7,Up,1,c tested 2025 volkswagen id buzz bee knee,354,286,234 mile range 70k great kia ev9 get 270 304 m...,2,33


In [35]:
df['text'] = df['text'].fillna('') + ' ' + df['comments'].fillna('')

In [36]:
df.drop(['comments'], axis=1, inplace=True)

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_text = vectorizer.fit_transform(df['text'])

X_text_df = pd.DataFrame(X_text.toarray(), columns=vectorizer.get_feature_names_out())

Merging encoded columns, numerical columns and tf-idf dataframe columns

In [38]:
numerical_columns = df[['Volume', 'score', 'num_comments', 'mention_count']]

X = pd.concat([numerical_columns.reset_index(drop=True),
               X_text_df.reset_index(drop=True),
               df[['company', 'subreddit', 'Sentiment']].reset_index(drop=True)], axis=1)

In [39]:
y = df['Price Movement'].map({'Up':2, 'Neutral':1, 'Down':0}) # encoding target variables

Splitting data in training and testing

In [40]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train = X_train.to_numpy()
X_test = X_test.to_numpy()

# Ensure y_train is a Series or NumPy array
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

Okay We will be training and testing following models intially

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

In [42]:
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier()
mnb = MultinomialNB()
dtc = DecisionTreeClassifier(max_depth=5)
lrc = LogisticRegression(solver='liblinear' , penalty='l1')
rfc = RandomForestClassifier(n_estimators=50, random_state=2)
abc = AdaBoostClassifier(n_estimators=50, random_state=2)
etc = ExtraTreesClassifier(n_estimators=50, random_state=2)
gbdt = GradientBoostingClassifier(n_estimators=50,random_state=2)
xgb = XGBClassifier(n_estimators=50,random_state=2)

In [43]:
clfs = {
    'SVC'       : svc,
    'KN'        : knc,
    'NB'        : mnb,
    'DT'        : dtc,
    'LR'        : lrc,
    'RF'        : rfc,
    'AdaBoost'  : abc,
    'ETC'       : etc,
    'GBDT'      : gbdt,
    'xgb'       : xgb,
}

This function will train and test our diffrenet ML models

In [44]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def train_classifier(clf, X_train, y_train, X_test, y_test):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted', zero_division=1)
    recall = recall_score(y_test, y_pred, average='weighted', zero_division=1)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=1)

    return accuracy, precision, recall, f1


Calling 'train_classifier' funtion and storing its performance of diffrent models in lists

In [45]:
accuracy_score_list = []
precision_score_list = []
recall_score_list = []
f1_score_list = []

for name,clf in clfs.items():
    current_accuracy,current_precision,current_recall,current_f1_score = train_classifier(clf,X_train,y_train,X_test,y_test)

    print("for ",name)
    print('Accuracy - ',current_accuracy)
    print('Precision - ',current_precision)
    print('recall - ',current_recall)
    print('f1_Score',current_f1_score)

    accuracy_score_list.append(current_accuracy)
    precision_score_list.append(current_precision)
    recall_score_list.append(current_recall)
    f1_score_list.append(current_f1_score)

for  SVC
Accuracy -  0.5892857142857143
Precision -  0.7579719387755102
recall -  0.5892857142857143
f1_Score 0.43699839486356346
for  KN
Accuracy -  0.5178571428571429
Precision -  0.5661764705882353
recall -  0.5178571428571429
f1_Score 0.5104591836734693
for  NB
Accuracy -  0.5714285714285714
Precision -  0.6041581632653061
recall -  0.5714285714285714
f1_Score 0.5844212466988888
for  DT
Accuracy -  0.375
Precision -  0.41316964285714286
recall -  0.375
f1_Score 0.3457867017774852




for  LR
Accuracy -  0.6071428571428571
Precision -  0.6071428571428571
recall -  0.6071428571428571
f1_Score 0.5970127181307305
for  RF
Accuracy -  0.375
Precision -  0.38445378151260506
recall -  0.375
f1_Score 0.3707356076759062




for  AdaBoost
Accuracy -  0.35714285714285715
Precision -  0.38257890365448505
recall -  0.35714285714285715
f1_Score 0.3122710622710623
for  ETC
Accuracy -  0.5535714285714286
Precision -  0.5495834180044706
recall -  0.5535714285714286
f1_Score 0.5405550024888004
for  GBDT
Accuracy -  0.5892857142857143
Precision -  0.5830399727458551
recall -  0.5892857142857143
f1_Score 0.5820905285190999
for  xgb
Accuracy -  0.5535714285714286
Precision -  0.5547619047619048
recall -  0.5535714285714286
f1_Score 0.5441437444543035


lets convert it into the dataframe to have a better look at our base model peformances

In [46]:
performance_df = pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy':accuracy_score_list,'Precision':precision_score_list,'Recall':recall_score_list,'f1_score':f1_score_list}).sort_values('Accuracy',ascending=False)

In [47]:
performance_df

Unnamed: 0,Algorithm,Accuracy,Precision,Recall,f1_score
4,LR,0.607143,0.607143,0.607143,0.597013
0,SVC,0.589286,0.757972,0.589286,0.436998
8,GBDT,0.589286,0.58304,0.589286,0.582091
2,NB,0.571429,0.604158,0.571429,0.584421
7,ETC,0.553571,0.549583,0.553571,0.540555
9,xgb,0.553571,0.554762,0.553571,0.544144
1,KN,0.517857,0.566176,0.517857,0.510459
3,DT,0.375,0.41317,0.375,0.345787
5,RF,0.375,0.384454,0.375,0.370736
6,AdaBoost,0.357143,0.382579,0.357143,0.312271


## No such Improvement after using voting Classifier

In [48]:
# Voting Classifier
svc = SVC(kernel='sigmoid', gamma=1.0, probability=True)
mnb = MultinomialNB()
gbdt = GradientBoostingClassifier(n_estimators=50,random_state=2)

from sklearn.ensemble import VotingClassifier

In [49]:
voting = VotingClassifier(estimators=[('svc', svc), ('nb', mnb), ('gbdt', gbdt)],voting='soft')

In [50]:
voting.fit(X_train,y_train)

In [51]:
y_pred = voting.predict(X_test)
print('Accuracy',accuracy_score(y_test, y_pred))
print('Precision',precision_score(y_test, y_pred, average = 'weighted', zero_division=1))
print('Recall',recall_score(y_test, y_pred, average='weighted', zero_division=1))
print('F1_score',f1_score(y_test, y_pred, average='weighted', zero_division=1))

Accuracy 0.625
Precision 0.6403456221198157
Recall 0.625
F1_score 0.6224591565349543


hmm the resuts are not looking good.
- Now we will select top 3 models(based on accuracy_score) and apply hyperparameter tunning to improve my model performances
- we will be using GridSearch for hyperparameter selection

## Note - Could not Continue as Hyperparamter tuning was taking lot of time
So, this the best result i was able to achieve for now