## Purpose
This notebook creates a logistic regression model and prints its accuracy score and confusion matrix

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
file_path = "all_price_data_withdatetime.csv"
price_df = pd.read_csv(file_path)
price_df

Unnamed: 0,date_time,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change
0,1995-07-01,1.365,0.024006,1.147,138.200,2.477,1
1,1995-08-01,1.328,-0.027106,1.161,138.800,2.482,0
2,1995-09-01,1.376,0.036145,1.159,139.500,2.459,1
3,1995-10-01,1.371,-0.003634,1.175,140.600,2.473,0
4,1995-11-01,1.368,-0.002188,1.169,141.000,2.493,1
...,...,...,...,...,...,...,...
326,2022-09-01,4.862,-0.015191,2.362,318.374,4.181,0
327,2022-10-01,4.836,-0.005348,2.386,319.917,4.184,1
328,2022-11-01,4.853,0.003515,2.419,320.034,4.218,1
329,2022-12-01,4.800,-0.010921,2.419,322.507,4.211,0


In [3]:
# Make datetime the index
price_df = price_df.set_index("date_time")
price_df.head()

Unnamed: 0_level_0,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1995-07-01,1.365,0.024006,1.147,138.2,2.477,1
1995-08-01,1.328,-0.027106,1.161,138.8,2.482,0
1995-09-01,1.376,0.036145,1.159,139.5,2.459,1
1995-10-01,1.371,-0.003634,1.175,140.6,2.473,0
1995-11-01,1.368,-0.002188,1.169,141.0,2.493,1


In [None]:
# Output this dataframe as CSV for future use
# price_df.to_csv("ML1_data.csv")

In [4]:
# Check out proportion of up months to down months
down_months = len(price_df.loc[price_df["DJIA_change"] == 0])
all_months = len(price_df)
up_months = all_months - down_months
print(down_months)
print(all_months)
print("----------")
prp_dwn = down_months / all_months
prp_up = up_months / all_months
print(f'The proportion of down months is {prp_dwn:.2f}%.')
print(f'The proportion of up months is {prp_up:.2f}%.')

125
331
----------
The proportion of down months is 0.38%.
The proportion of up months is 0.62%.


In [5]:
# Slightly more than one third of all months in dataset declined.
# That means slightly less than two thirds saw an increase.
# We may want to strtify the data in train_test_split

In [6]:
# Separate features from target

# The target is whether the DJIA went up or down
y = price_df["DJIA_change"]

# Features are all other data
X = price_df.drop(columns="DJIA_change")

In [None]:
# Split into training and testing sets
# First try without stratifying data
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
# Create logistic regression model
classifier = LogisticRegression(solver='lbfgs', max_iter=200)

classifier.fit(X_train, y_train)

In [None]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

In [None]:
report = classification_report(y_test, y_pred)
print(report)

## Notes:
As noted above, the proportion of all positive months is about 62%. That means if you were to have predicted that the DJIA would go up every month, you would have been right 62% of the time. With an accuracy score of 65%, this model is performing slightly better than that. 

### Suggestions to try to improve the model

- increase maximum iterations
- stratify the testing and training data
- include percent change data for milk, wheat, and food cpi prices

## Attempt 2:
This attempt will add percent change data and stratify the training and testing sets.

In [None]:
price_df

In [None]:
price_df["Wheat_Pct_Change"] = price_df["Wheat_Price"].pct_change()
price_df.head()

In [None]:
price_df["Milk_Pct_Change"] = price_df["Milk Cost per Gallon"].pct_change()
price_df.head()

In [None]:
price_df["CPI_Pct_Change"] = price_df["CPI_Price"].pct_change()
price_df.head()

In [None]:
# Remove rows with null values
price_nona_df = price_df.dropna()
price_nona_df

In [None]:
# Export above df as csv
# price_nona_df.to_csv("../../../Edited Data/Output/prices_and_pct_change.csv")

In [None]:
# Separate features from target

# The target is whether the DJIA went up or down
y = price_nona_df["DJIA_change"]

# Features are all other data
X = price_nona_df.drop(columns="DJIA_change")

In [None]:
# Split into training and testing sets
# Stratify data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

In [None]:
# Create logistic regression model
classifier = LogisticRegression(solver='lbfgs', max_iter=200)

classifier.fit(X_train, y_train)

In [None]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

In [None]:
report = classification_report(y_test, y_pred)
print(report)

### Notes:
These changes made a slight improvement. With the new percent change data and stratifcation of the training and testing sets, the same model received an accuracy score of about 66.3%, compared to the previous score of 65.2%. 

## Attempt 3:
Run the same model with the same data as in Attempt to, but with more iterations

In [None]:
# Create logistic regression model
# Double maximum iterations to 400
classifier = LogisticRegression(solver='lbfgs', max_iter=400)

classifier.fit(X_train, y_train)

In [None]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

In [None]:
report = classification_report(y_test, y_pred)
print(report)

### Notes:
This made no difference.

## Attempt 3:
Decrease iterations from the original 200.

In [None]:
# Create logistic regression model
# Decrease max iterations to 150
classifier = LogisticRegression(solver='lbfgs', max_iter=150)

classifier.fit(X_train, y_train)

In [None]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

In [None]:
report = classification_report(y_test, y_pred)
print(report)

### Notes:
No meaningful difference.

## Attempt 4:
Test same model from Attempt 2 (most successful) on data from after 2008

In [7]:
# Visualize df
price_df = pd.read_csv("../../../Edited Data/Output/prices_and_pct_change.csv")
price_df.head()

Unnamed: 0,date_time,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change,Wheat_Pct_Change,Milk_Pct_Change,CPI_Pct_Change
0,1995-08-01,1.328,-0.027106,1.161,138.8,2.482,0,0.012206,0.002019,0.004342
1,1995-09-01,1.376,0.036145,1.159,139.5,2.459,1,-0.001723,-0.009267,0.005043
2,1995-10-01,1.371,-0.003634,1.175,140.6,2.473,0,0.013805,0.005693,0.007885
3,1995-11-01,1.368,-0.002188,1.169,141.0,2.493,1,-0.005106,0.008087,0.002845
4,1995-12-01,1.403,0.025585,1.154,141.5,2.518,1,-0.012831,0.010028,0.003546


In [8]:
price_df.dtypes

date_time                object
Beef $/LB               float64
Beef_Pct_Change         float64
Wheat_Price             float64
CPI_Price               float64
Milk Cost per Gallon    float64
DJIA_change               int64
Wheat_Pct_Change        float64
Milk_Pct_Change         float64
CPI_Pct_Change          float64
dtype: object

In [11]:
# Convert dates to datetime format
price_df["date_time"] = pd.to_datetime(price_df["date_time"], format="%Y-%m-%d")
price_df.dtypes

date_time               datetime64[ns]
Beef $/LB                      float64
Beef_Pct_Change                float64
Wheat_Price                    float64
CPI_Price                      float64
Milk Cost per Gallon           float64
DJIA_change                      int64
Wheat_Pct_Change               float64
Milk_Pct_Change                float64
CPI_Pct_Change                 float64
dtype: object

In [12]:
price_2008_df = price_df.loc[price_df["date_time"] >= "2008-01-01"]
price_2008_df.head()

Unnamed: 0,date_time,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change,Wheat_Pct_Change,Milk_Pct_Change,CPI_Pct_Change
149,2008-01-01,2.328,0.042077,1.812,199.793,3.871,0,0.001658,0.000258,0.005046
150,2008-02-01,2.381,0.022766,1.884,199.835,3.869,0,0.039735,-0.000517,0.00021
151,2008-03-01,2.293,-0.036959,1.873,200.059,3.781,0,-0.005839,-0.022745,0.001121
152,2008-04-01,2.323,0.013083,1.92,201.079,3.799,1,0.025093,0.004761,0.005098
153,2008-05-01,2.313,-0.004305,1.973,201.873,3.76,0,0.027604,-0.010266,0.003949


In [25]:
# Drop date column because it's irrelevant
price_2008_df = price_2008_df.drop("Date", axis=1)
price_2008_df

Unnamed: 0,Beef $/LB,Beef_Pct_Change,Wheat_Price,CPI_Price,Milk Cost per Gallon,DJIA_change,Wheat_Pct_Change,Milk_Pct_Change,CPI_Pct_Change
149,2.328,0.042077,1.812,199.793,3.871,0,0.001658,0.000258,0.005046
150,2.381,0.022766,1.884,199.835,3.869,0,0.039735,-0.000517,0.000210
151,2.293,-0.036959,1.873,200.059,3.781,0,-0.005839,-0.022745,0.001121
152,2.323,0.013083,1.920,201.079,3.799,1,0.025093,0.004761,0.005098
153,2.313,-0.004305,1.973,201.873,3.760,0,0.027604,-0.010266,0.003949
...,...,...,...,...,...,...,...,...,...
325,4.862,-0.015191,2.362,318.374,4.181,0,0.027850,-0.003100,0.002964
326,4.836,-0.005348,2.386,319.917,4.184,1,0.010161,0.000718,0.004847
327,4.853,0.003515,2.419,320.034,4.218,1,0.013831,0.008126,0.000366
328,4.800,-0.010921,2.419,322.507,4.211,0,0.000000,-0.001660,0.007727


In [26]:
# Separate features from target

# The target is whether the DJIA went up or down
y = price_2008_df["DJIA_change"]

# Features are all other data
X = price_2008_df.drop(columns="DJIA_change")

In [27]:
# Split into training and testing sets
# Stratify data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

In [28]:
# Create logistic regression model
classifier = LogisticRegression(solver='lbfgs', max_iter=200)

classifier.fit(X_train, y_train)

LogisticRegression(max_iter=200)

In [29]:
# Make predicitons
y_pred = classifier.predict(X_test)
results_df = pd.DataFrame({"Prediction": y_pred, "Actual": y_test}).reset_index(drop=True)
results_df.head(20)

Unnamed: 0,Prediction,Actual
0,1,1
1,1,1
2,1,0
3,1,0
4,1,0
5,1,1
6,1,1
7,1,1
8,1,1
9,1,0


In [30]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.6521739130434783


In [31]:
from sklearn.metrics import confusion_matrix, classification_report

In [32]:
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

[[ 1 16]
 [ 0 29]]


In [33]:
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       1.00      0.06      0.11        17
           1       0.64      1.00      0.78        29

    accuracy                           0.65        46
   macro avg       0.82      0.53      0.45        46
weighted avg       0.78      0.65      0.54        46

