## Purpose
This notebook applies ML model 1 (originally designed for data in the `same_date` folder) to the new data set that matches food price data to DJIA data from the following month. As noted in the Jupyter Notebook in the `same_date` folder where this model was built, it performed moderately well, so it will be applied here without any adjustments as a first step.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
file_path = "ML2_pred_data.csv"
price_df = pd.read_csv(file_path)
price_df

Unnamed: 0,date_time,Beef $/LB,Beef_Pct_Change (Monthly),Price,Food - Index (1982-1984=100),Food - Pct_Change (Yearly),Food - Pct_Change (Monthly),All Items - Index (1982-1984=100),All Items - Pct_Change (Yearly),All Items - Pct_Change (Monthly),Milk Cost per Gallon,Next_Month_DJIA_Change
0,1995-07-01,1.365,2.400600,1.147,138.200,0.582242,0.290276,152.600,2.830189,0.131234,2.477,0
1,1995-08-01,1.328,-2.710623,1.161,138.800,1.166181,0.434153,152.900,2.617450,0.196592,2.482,1
2,1995-09-01,1.376,3.614458,1.159,139.500,1.528384,0.504323,153.100,2.545211,0.130804,2.459,0
3,1995-10-01,1.371,-0.363372,1.175,140.600,2.852963,0.788530,153.500,2.744311,0.261267,2.473,1
4,1995-11-01,1.368,-0.218818,1.169,141.000,3.296703,0.284495,153.700,2.603471,0.130293,2.493,1
...,...,...,...,...,...,...,...,...,...,...,...,...
325,2022-08-01,4.937,0.899244,2.298,317.433,10.640456,0.518054,295.320,8.227361,0.234872,4.194,0
326,2022-09-01,4.862,-1.519141,2.362,318.374,8.964276,0.296441,296.539,8.214854,0.412773,4.181,1
327,2022-10-01,4.836,-0.534759,2.386,319.917,8.002714,0.484650,297.987,7.762493,0.488300,4.184,1
328,2022-11-01,4.853,0.351530,2.419,320.034,6.721044,0.036572,298.598,7.135348,0.205043,4.218,0


In [3]:
# Make datetime the index
price_df = price_df.set_index("date_time")
price_df.head()

Unnamed: 0_level_0,Beef $/LB,Beef_Pct_Change (Monthly),Price,Food - Index (1982-1984=100),Food - Pct_Change (Yearly),Food - Pct_Change (Monthly),All Items - Index (1982-1984=100),All Items - Pct_Change (Yearly),All Items - Pct_Change (Monthly),Milk Cost per Gallon,Next_Month_DJIA_Change
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1995-07-01,1.365,2.4006,1.147,138.2,0.582242,0.290276,152.6,2.830189,0.131234,2.477,0
1995-08-01,1.328,-2.710623,1.161,138.8,1.166181,0.434153,152.9,2.61745,0.196592,2.482,1
1995-09-01,1.376,3.614458,1.159,139.5,1.528384,0.504323,153.1,2.545211,0.130804,2.459,0
1995-10-01,1.371,-0.363372,1.175,140.6,2.852963,0.78853,153.5,2.744311,0.261267,2.473,1
1995-11-01,1.368,-0.218818,1.169,141.0,3.296703,0.284495,153.7,2.603471,0.130293,2.493,1


In [4]:
# Separate features from target

# The target is whether the DJIA went up or down
y = price_df["Next_Month_DJIA_Change"]

# Features are all other data
X = price_df.drop(columns="Next_Month_DJIA_Change")

In [5]:
# Split into training and testing sets
# First try without stratifying data
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [6]:
# Create logistic regression model
classifier = LogisticRegression(solver='lbfgs', max_iter=200)

classifier.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(max_iter=200)

In [7]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

Unnamed: 0,Prediction,Actual
0,1,1
1,1,0
2,1,1
3,1,1
4,1,1
5,0,1
6,0,1
7,1,1
8,1,0
9,1,1


In [8]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.5662650602409639


In [9]:
from sklearn.metrics import confusion_matrix, classification_report

matrix = confusion_matrix(y_test, y_pred)
print(matrix)

[[ 3 30]
 [ 6 44]]


In [10]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.33      0.09      0.14        33
           1       0.59      0.88      0.71        50

    accuracy                           0.57        83
   macro avg       0.46      0.49      0.43        83
weighted avg       0.49      0.57      0.48        83



## Notes:
This model performed slightly worse at predicting DJIA movement one month ahead than it did at predicting DJIA movement for the same month. Here, it received an accuracy score of about 60%, compared to the 65% it received on the `same_date` data. Notably, precision went up in the negative (`0`) category. 

### Suggestions to try to improve the model

- increase maximum iterations
- stratify the testing and training data
- include percent change data for milk, wheat, and food cpi prices