## Purpose
This notebook creates a logistic regression model and prints its accuracy score and confusion matrix

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
file_path = "all_price_data_withdatetime.csv"
price_df = pd.read_csv(file_path)
price_df

Unnamed: 0,date_time,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change
0,1995-07-01,1.365,0.024006,1.147,138.200,2.477,1
1,1995-08-01,1.328,-0.027106,1.161,138.800,2.482,0
2,1995-09-01,1.376,0.036145,1.159,139.500,2.459,1
3,1995-10-01,1.371,-0.003634,1.175,140.600,2.473,0
4,1995-11-01,1.368,-0.002188,1.169,141.000,2.493,1
...,...,...,...,...,...,...,...
326,2022-09-01,4.862,-0.015191,2.362,318.374,4.181,0
327,2022-10-01,4.836,-0.005348,2.386,319.917,4.184,1
328,2022-11-01,4.853,0.003515,2.419,320.034,4.218,1
329,2022-12-01,4.800,-0.010921,2.419,322.507,4.211,0


In [3]:
# Make datetime the index
price_df = price_df.set_index("date_time")
price_df.head()

Unnamed: 0_level_0,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1995-07-01,1.365,0.024006,1.147,138.2,2.477,1
1995-08-01,1.328,-0.027106,1.161,138.8,2.482,0
1995-09-01,1.376,0.036145,1.159,139.5,2.459,1
1995-10-01,1.371,-0.003634,1.175,140.6,2.473,0
1995-11-01,1.368,-0.002188,1.169,141.0,2.493,1


In [5]:
# Output this dataframe as CSV for future use
# price_df.to_csv("ML1_data.csv")

In [16]:
# Check out proportion of up months to down months
down_months = len(price_df.loc[price_df["DJIA_change"] == 0])
all_months = len(price_df)
up_months = all_months - down_months
print(down_months)
print(all_months)
print("----------")
prp_dwn = down_months / all_months
prp_up = up_months / all_months
print(f'The proportion of down months is {prp_dwn:.2f}%.')
print(f'The proportion of up months is {prp_up:.2f}%.')

125
331
----------
The proportion of down months is 0.38%.
The proportion of up months is 0.62%.


In [None]:
# Slightly more than one third of all months in dataset declined.
# That means slightly less than two thirds saw an increase.
# We may want to strtify the data in train_test_split

In [4]:
# Separate features from target

# The target is whether the DJIA went up or down
y = price_df["DJIA_change"]

# Features are all other data
X = price_df.drop(columns="DJIA_change")

In [9]:
# Split into training and testing sets
# First try without stratifying data
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [10]:
# Create logistic regression model
classifier = LogisticRegression(solver='lbfgs', max_iter=200)

classifier.fit(X_train, y_train)

LogisticRegression(max_iter=200)

In [11]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

Unnamed: 0,Prediction,Actual
0,1,1
1,1,1
2,1,1
3,0,0
4,1,1
5,1,1
6,1,0
7,1,1
8,1,1
9,1,1


In [12]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.6506024096385542


In [13]:
from sklearn.metrics import confusion_matrix, classification_report

In [14]:
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

[[ 4 22]
 [ 7 50]]


In [15]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.36      0.15      0.22        26
           1       0.69      0.88      0.78        57

    accuracy                           0.65        83
   macro avg       0.53      0.52      0.50        83
weighted avg       0.59      0.65      0.60        83



## Notes:
As noted above, the proportion of all positive months is about 62%. That means if you were to have predicted that the DJIA would go up every month, you would have been right 62% of the time. With an accuracy score of 65%, this model is performing slightly better than that. 

### Suggestions to try to improve the model

- increase maximum iterations
- stratify the testing and training data
- include percent change data for milk, wheat, and food cpi prices

## Attempt 2:
This attempt will add percent change data and stratify the training and testing sets.

In [4]:
price_df

Unnamed: 0_level_0,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1995-07-01,1.365,0.024006,1.147,138.200,2.477,1
1995-08-01,1.328,-0.027106,1.161,138.800,2.482,0
1995-09-01,1.376,0.036145,1.159,139.500,2.459,1
1995-10-01,1.371,-0.003634,1.175,140.600,2.473,0
1995-11-01,1.368,-0.002188,1.169,141.000,2.493,1
...,...,...,...,...,...,...
2022-09-01,4.862,-0.015191,2.362,318.374,4.181,0
2022-10-01,4.836,-0.005348,2.386,319.917,4.184,1
2022-11-01,4.853,0.003515,2.419,320.034,4.218,1
2022-12-01,4.800,-0.010921,2.419,322.507,4.211,0


In [5]:
price_df["Wheat_Pct_Change"] = price_df["Wheat_Price"].pct_change()
price_df.head()

Unnamed: 0_level_0,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change,Wheat_Pct_Change
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1995-07-01,1.365,0.024006,1.147,138.2,2.477,1,
1995-08-01,1.328,-0.027106,1.161,138.8,2.482,0,0.012206
1995-09-01,1.376,0.036145,1.159,139.5,2.459,1,-0.001723
1995-10-01,1.371,-0.003634,1.175,140.6,2.473,0,0.013805
1995-11-01,1.368,-0.002188,1.169,141.0,2.493,1,-0.005106


In [6]:
price_df["Milk_Pct_Change"] = price_df["Milk Cost per Gallon"].pct_change()
price_df.head()

Unnamed: 0_level_0,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change,Wheat_Pct_Change,Milk_Pct_Change
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1995-07-01,1.365,0.024006,1.147,138.2,2.477,1,,
1995-08-01,1.328,-0.027106,1.161,138.8,2.482,0,0.012206,0.002019
1995-09-01,1.376,0.036145,1.159,139.5,2.459,1,-0.001723,-0.009267
1995-10-01,1.371,-0.003634,1.175,140.6,2.473,0,0.013805,0.005693
1995-11-01,1.368,-0.002188,1.169,141.0,2.493,1,-0.005106,0.008087


In [7]:
price_df["CPI_Pct_Change"] = price_df["CPI_Price"].pct_change()
price_df.head()

Unnamed: 0_level_0,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change,Wheat_Pct_Change,Milk_Pct_Change,CPI_Pct_Change
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1995-07-01,1.365,0.024006,1.147,138.2,2.477,1,,,
1995-08-01,1.328,-0.027106,1.161,138.8,2.482,0,0.012206,0.002019,0.004342
1995-09-01,1.376,0.036145,1.159,139.5,2.459,1,-0.001723,-0.009267,0.005043
1995-10-01,1.371,-0.003634,1.175,140.6,2.473,0,0.013805,0.005693,0.007885
1995-11-01,1.368,-0.002188,1.169,141.0,2.493,1,-0.005106,0.008087,0.002845


In [12]:
# Remove rows with null values
price_nona_df = price_df.dropna()
price_nona_df

Unnamed: 0_level_0,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change,Wheat_Pct_Change,Milk_Pct_Change,CPI_Pct_Change
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1995-08-01,1.328,-0.027106,1.161,138.800,2.482,0,0.012206,0.002019,0.004342
1995-09-01,1.376,0.036145,1.159,139.500,2.459,1,-0.001723,-0.009267,0.005043
1995-10-01,1.371,-0.003634,1.175,140.600,2.473,0,0.013805,0.005693,0.007885
1995-11-01,1.368,-0.002188,1.169,141.000,2.493,1,-0.005106,0.008087,0.002845
1995-12-01,1.403,0.025585,1.154,141.500,2.518,1,-0.012831,0.010028,0.003546
...,...,...,...,...,...,...,...,...,...
2022-09-01,4.862,-0.015191,2.362,318.374,4.181,0,0.027850,-0.003100,0.002964
2022-10-01,4.836,-0.005348,2.386,319.917,4.184,1,0.010161,0.000718,0.004847
2022-11-01,4.853,0.003515,2.419,320.034,4.218,1,0.013831,0.008126,0.000366
2022-12-01,4.800,-0.010921,2.419,322.507,4.211,0,0.000000,-0.001660,0.007727


In [22]:
# Separate features from target

# The target is whether the DJIA went up or down
y = price_nona_df["DJIA_change"]

# Features are all other data
X = price_nona_df.drop(columns="DJIA_change")

In [23]:
# Split into training and testing sets
# Stratify data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

In [24]:
# Create logistic regression model
classifier = LogisticRegression(solver='lbfgs', max_iter=200)

classifier.fit(X_train, y_train)

LogisticRegression(max_iter=200)

In [25]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

Unnamed: 0,Prediction,Actual
0,1,0
1,1,1
2,1,1
3,1,0
4,1,0
5,1,1
6,1,1
7,1,1
8,1,1
9,1,1


In [26]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.6626506024096386


In [27]:
from sklearn.metrics import confusion_matrix, classification_report

In [28]:
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

[[ 4 27]
 [ 1 51]]


In [29]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.80      0.13      0.22        31
           1       0.65      0.98      0.78        52

    accuracy                           0.66        83
   macro avg       0.73      0.55      0.50        83
weighted avg       0.71      0.66      0.57        83



### Notes:
These changes made a slight improvement. With the new percent change data and stratifcation of the training and testing sets, the same model received an accuracy score of about 66.3%, compared to the previous score of 65.2%. 

## Attempt 3:
Run the same model with the same data as in Attempt to, but with more iterations

In [33]:
# Create logistic regression model
# Double maximum iterations to 400
classifier = LogisticRegression(solver='lbfgs', max_iter=400)

classifier.fit(X_train, y_train)

LogisticRegression(max_iter=400)

In [34]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

Unnamed: 0,Prediction,Actual
0,1,0
1,1,1
2,1,1
3,1,0
4,1,0
5,1,1
6,1,1
7,1,1
8,1,1
9,1,1


In [35]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.6626506024096386


In [36]:
from sklearn.metrics import confusion_matrix, classification_report

In [37]:
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

[[ 4 27]
 [ 1 51]]


In [38]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.80      0.13      0.22        31
           1       0.65      0.98      0.78        52

    accuracy                           0.66        83
   macro avg       0.73      0.55      0.50        83
weighted avg       0.71      0.66      0.57        83



### Notes:
This made no difference.

## Attempt 3:
Decrease iterations from the original 200.

In [42]:
# Create logistic regression model
# Decrease max iterations to 150
classifier = LogisticRegression(solver='lbfgs', max_iter=150)

classifier.fit(X_train, y_train)

LogisticRegression(max_iter=150)

In [43]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

Unnamed: 0,Prediction,Actual
0,1,0
1,1,1
2,1,1
3,1,0
4,1,0
5,1,1
6,1,1
7,1,1
8,1,1
9,1,1


In [44]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.6626506024096386


In [45]:
from sklearn.metrics import confusion_matrix, classification_report

In [46]:
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

[[ 4 27]
 [ 1 51]]


In [47]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.80      0.13      0.22        31
           1       0.65      0.98      0.78        52

    accuracy                           0.66        83
   macro avg       0.73      0.55      0.50        83
weighted avg       0.71      0.66      0.57        83



### Notes:
No meaningful difference.