# Logistic Regression
This programm runs logistic regression in order to predict wether stock is a good option to buy. Stock is classified as 'Buy' if it'll beat SP500 and ROI is above 2%.
Since in investing it is more important to avoid losses than to achieve profits, I focused on maximizing the precision of the model (minimizing False Positives relative to True Positives).

### 1. Imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

### 2. Load the data

In [105]:
data = pd.read_csv('stocks_data.csv')
data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Ticker,Year,Month,MA Ratio,Buy,ROE,Insider Ownership Growth,Institutional Ownership Growth,Forecast EPS Growth,Avg 2Q EPS Growth,Avg 2Q EPS Surprise,YoY EPS Growth,Sector Performance,Market Performance
count,14854.0,14854,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0,14854.0
unique,,393,,,,,,,,,,,,,
top,,AWK,,,,,,,,,,,,,
freq,,55,,,,,,,,,,,,,
mean,7426.5,,2020.669113,6.22324,1.004148,0.526727,39.494365,0.015486,0.026708,0.057775,0.181477,13.755183,0.369529,1.488003,1.438443
std,4288.124784,,1.428016,3.520757,0.046473,0.499302,181.839873,0.269863,0.230675,2.136724,2.111809,46.751483,3.637998,8.164589,7.038394
min,0.0,,2018.0,1.0,0.580721,0.0,-613.743387,-0.633527,-0.714136,-0.992366,-45.05,-65.625,-0.961538,-44.900728,-22.795349
25%,3713.25,,2019.0,3.0,0.977766,0.0,10.160854,-0.00135,-0.023114,-0.184264,-0.040838,2.015,0.017606,-3.453784,-3.160007
50%,7426.5,,2021.0,6.0,1.00536,1.0,19.251991,0.0,-0.000648,-0.039062,0.045662,6.055,0.130688,1.496227,2.069271
75%,11139.75,,2022.0,9.0,1.031953,1.0,31.949569,0.008,0.033653,0.086957,0.154182,13.135,0.275148,6.429508,5.50743


### 3. Clean the data
Let's remove outliers.

In [106]:
for column in ['ROE', 'Insider Ownership Growth', 'Institutional Ownership Growth', 'Forecast EPS Growth', 'Avg 2Q EPS Growth', 'YoY EPS Growth']:
    upper_bound = data[column].quantile(0.999)
    data = data[(data[column] <= upper_bound)]
q = data['Avg 2Q EPS Growth'].quantile(0.001)
data = data[(data['Avg 2Q EPS Growth'] >= q)]
data.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Ticker,Year,Month,MA Ratio,Buy,ROE,Insider Ownership Growth,Institutional Ownership Growth,Forecast EPS Growth,Avg 2Q EPS Growth,Avg 2Q EPS Surprise,YoY EPS Growth,Sector Performance,Market Performance
count,14757.0,14757,14757.0,14757.0,14757.0,14757.0,14757.0,14757.0,14757.0,14757.0,14757.0,14757.0,14757.0,14757.0,14757.0
unique,,391,,,,,,,,,,,,,
top,,IQV,,,,,,,,,,,,,
freq,,55,,,,,,,,,,,,,
mean,7431.564546,,2020.665447,6.222877,1.004135,0.526733,35.930619,0.009596,0.021883,0.011317,0.145736,13.522146,0.272787,1.486683,1.441679
std,4285.343004,,1.428264,3.521282,0.046414,0.499302,99.720733,0.072866,0.133643,0.519399,0.806091,45.365794,1.069267,8.159292,7.043774
min,0.0,,2018.0,1.0,0.580721,0.0,-613.743387,-0.633527,-0.714136,-0.992366,-3.489011,-65.625,-0.961538,-44.900728,-22.795349
25%,3726.0,,2019.0,3.0,0.977791,0.0,10.176162,-0.001361,-0.023167,-0.182927,-0.040646,1.995,0.018152,-3.453784,-3.160007
50%,7430.0,,2021.0,6.0,1.005388,1.0,19.293997,0.0,-0.000699,-0.038835,0.045532,6.01,0.130396,1.496227,2.069271
75%,11144.0,,2022.0,9.0,1.031914,1.0,31.949569,0.007968,0.033287,0.086705,0.152364,13.08,0.274611,6.429508,5.50743


### 4. Split the data for train and test, standarise the data

In [107]:
data = data.reset_index(drop=True)
train_data = data[data['Year'] <= 2022]
test_data = data[data['Year'] > 2022]
x_train = train_data.drop(['Year', 'Buy', 'Month', 'Ticker', data.columns[0]], axis=1)
y_train = train_data['Buy']
x_test = test_data.drop(['Year', 'Buy', 'Month', 'Ticker', data.columns[0]], axis=1)
y_test = test_data['Buy']

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)

x_test = scaler.transform(x_test)

### 5. Train the model
Class's 1 precision is a priority, as in investing avoiding losses is more valuable than gaining profits. Therefore I gave 20% more weight for class 0 (model has to avoid as much False Positives relative to True Positives as possible).

In [108]:
class_weight = {0: 120, 1: 100}
model = LogisticRegression(max_iter = 10000, class_weight=class_weight)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

### 6. Evaluation

In [109]:
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
precision = precision_score(y_test, y_pred)
print(f'Precision: {precision}')

conf_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:')
print(conf_matrix)

class_report = classification_report(y_test, y_pred)
print('Classification Report:')
print(class_report)

Accuracy: 0.5367510809141446
Precision: 0.5817490494296578
Confusion Matrix:
[[716 110]
 [640 153]]
Classification Report:
              precision    recall  f1-score   support

           0       0.53      0.87      0.66       826
           1       0.58      0.19      0.29       793

    accuracy                           0.54      1619
   macro avg       0.55      0.53      0.47      1619
weighted avg       0.55      0.54      0.48      1619



### 7. Conclusion
The model's performance metrics reveal a deliberate trade-off due to the increased weighting for class 0. By focusing on minimizing False Positives, the model sacrifices some sensitivity to class 1 (as evidenced by the low recall of 0.19), but achieves a higher precision (0.58) for this class.

While the overall model accuracy is modest at 53.7%, the higher precision for class 1 indicates that when the model predicts an outcome as a 'Buy', it is more likely to be correct, which could be valuable for investment strategy. However, the model's recall for class 1 is quite low, suggesting that many opportunities might be missed.

Given these results, there is potential to build a profitable strategy focused on correctly identifying true positives, though the model would benefit from further refinement. To improve robustness and generalization, a larger and more diverse dataset that covers a broader range of dates and market conditions is needed. This could help the model better distinguish between the classes and improve overall predictive performance