# Part 1 - Initial Data Preparation

In [14]:
# Pulling in the dataset and looking at its values
import pandas as pd
dataframe = pd.read_csv("Online Retail.csv")


# Cleaning the data so that we don't have large outliers that skew data
# limit some of the data down and clean it so that we can see values on the graph properly
dataframe = dataframe[dataframe['UnitPrice'] >= 0]
dataframe = dataframe[dataframe['UnitPrice'] <= 100]
dataframe = dataframe[dataframe['Quantity'] >= 0]
dataframe = dataframe[dataframe['Quantity'] <= 100]

In [15]:
# keep cleaning

# create new columns containing individual pieces from the InvoiceDate
dataframe["InvoiceDate"] = pd.to_datetime(dataframe["InvoiceDate"], format='%Y-%m-%d %H:%M:%S')
dataframe['trans_month'] = dataframe['InvoiceDate'].dt.month
dataframe['trans_week'] = dataframe['InvoiceDate'].dt.isocalendar().week
dataframe["trans_day"] = dataframe['InvoiceDate'].dt.dayofyear

#Creating a total_cost feature for each row and limiting the working set to rows that
# contain total cost values less than 2000. This limit gets rid of outliers that greatly increase the graph size
dataframe['total_cost'] = dataframe['Quantity'] * dataframe['UnitPrice']
dataframe = dataframe[dataframe['total_cost'] <= 2000]

# calculate weekly total spending
weekly_total = []
for i in range(1, 53):
    weekly_total.append(dataframe[dataframe["trans_week"] == i]["UnitPrice"].sum())

# add the average weekly total to each row
mapping_dict = dict(zip(range(1, 53), weekly_total))
dataframe["weekly_total"] = dataframe['trans_week'].map(mapping_dict)

# calculate weekly spending average
average_weekly_cost = []
for i in range(1, 53):
    average_weekly_cost.append(dataframe[dataframe["trans_week"] == i]["total_cost"].mean())

# add the average weekly total to each row
mapping_dict = dict(zip(range(1, 53), average_weekly_cost))
dataframe["average_weekly_total"] = dataframe['trans_week'].map(mapping_dict)

# calculate daily spending average
average_daily_cost = []
for i in range(1, 366):
    average_daily_cost.append(dataframe[dataframe["trans_day"] == i]["UnitPrice"].mean())

# add the average daily total to each row
mapping_dict = dict(zip(range(1, 366), average_daily_cost))
dataframe["average_daily_total"] = dataframe['trans_day'].map(mapping_dict)

In [16]:
# Splitting the data
from sklearn.model_selection import train_test_split

# create both sets with a test size of 20 percent
train_set, test_set = train_test_split(dataframe, test_size=0.2, random_state=46)

# Part 2 - Picking Feature/s for X and the target feature y

For the features for X I am choosing  "trans_week", "weekly_total", "average_daily_total" as they should give a good insight into how much a person may spend on a weekly basis. By looking at the transaction week, the weekly total for these transactions, and the average spending over a daily time I think that we could get some use predictive power out of the model. 

# Part 3 - Doing Linear Regression

In [27]:
# Doing Linear Regression
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

X = train_set[["trans_week", "weekly_total", "average_daily_total"]]
y = train_set["average_weekly_total"]

linreg.fit(X, y)

Ypred = linreg.predict(X)
print("R2 Score:", linreg.score(X, y))
print("RMSE:", mean_squared_error(y, Ypred))

R2 Score: 0.0736282081512738
RMSE: 2.1318226059088436


# Part 4 - Result Comments

### R2 result
```R2 Score: 0.0736282081512738```

When taking a look at the R2 score it is defintely not great. We should be shooting for something that is closer to 1. While I think there is definitely a relation ship here I don't think that it will be visible until some transformations are done to the data. When taking a look at the the values in weekly total they are far larger than the values in average_daily_total and average_weekly_total. I think this causes a large spread in the data and then ends up causing this very low R2 score. Moving forward I think this data should be scaled properly and I ultimately don't think the relationship here is linear. 


### RMSE Result
```RMSE: 2.1318226059088436```

The RMSE error of 2.13 is not great either. Its saying that on average the results of the model will deviate by 2.13 units during prediction. If we were to be predicting super large values this would be acceptable, but since we are ultimately trying to predict average weekly spending results we would want to minimize this value as much as possible.


# Part 5 - Trying to Improve

In [18]:
# Creating new features that may help increase the linear regression of the model

# grab the number of transactions per week and day
weekly_transactions = dataframe.groupby('trans_week').size()
daily_transactions = dataframe.groupby('trans_day').size()

#create weekly and daily transaction rows. Assign the values to the correct day or week value
mapping_dict = dict(zip(range(1, 53), weekly_transactions))
dataframe["weekly_transactions"] = dataframe['trans_week'].map(mapping_dict)

mapping_dict = dict(zip(range(1, 366), daily_transactions))
dataframe["daily_transactions"] = dataframe['trans_day'].map(mapping_dict)

# so we can perform linear regression remove the rows that contain NaN values
dataframe.replace('NaN', None, inplace=True)
dataframe = dataframe.dropna(subset=['daily_transactions'])

#split again with newly added features
from sklearn.model_selection import train_test_split

# create both sets with a test size of 20 percent
second_train_set, second_test_set = train_test_split(dataframe, test_size=0.2, random_state=46)

In [31]:
# Running the regression again on the new features
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

X = second_train_set[["daily_transactions", "weekly_transactions", "average_daily_total"]]
y = second_train_set["average_weekly_total"]

linreg.fit(X, y)
Ypred = linreg.predict(X)

print("R2 Score:", linreg.score(X, y))
print("RMSE:", mean_squared_error(y, Ypred))

R2 Score: 0.04650859915571137
RMSE: 2.6838720458266185


## Improvement Attempt Comments

Original Scores
```
R2 Score: 0.0736282081512738
RMSE: 2.1318226059088436
```

New Scores
```
R2 Score: 0.04650859915571137
RMSE: 2.6838720458266185
```

Adding the weekly and daily transactions to the model did not improve the overall usefulness of the model. It actually dropped the R2 value even lower and added nearly .5 to the RMSE. After further analysis of the newly created features it can be seen that daily_transactions and week_transactions values are much much larger than that of the average_daily_total. I think this is causing a big spread in the data and ultimately base scores on this model. Overall, further analysis of the data will be needed, and potentially a different model other than a line

---

# Part 6 - Final Test Set Evaluation

Since I got marginally better results with the first set of data that I used I am going to stick with that one when going through the test set. 

In [25]:
from sklearn.metrics import mean_squared_error

X_eval = test_set[["trans_week", "weekly_total", "average_daily_total"]]
y_true = test_set["average_weekly_total"]
y_pred = linreg.predict(X_eval)

print(y_pred)

r2 = linreg.score(X_eval, y_true)
rms_error = mean_squared_error(y_true, y_pred)
print("R2:", r2)
print("rms:", rms_error)

[16.14587145 16.02963029 16.10768012 ... 15.9319921  16.20004547
 16.29532055]
R2: 0.07267753822610223
rms: 2.129693450703618


# Final Evaluation Discussion

After running the model on the **test_set** I ultimately got the same results. The R2 value changed by ~0.001 and the RMSE changed by less than 0.01. For the inputs I feel that they all provide good insight into the value being predicted, but I don't think its a linear correlation. Its easy to see in my initial exploration that there is a non linear relationship between the weeks of the year and the amounts of spending, and total transactions. I think that I could capture that here as well by properly scaling the values and fitting a different function to the data thats not linear. 

Average_daily_total and weekly_total, in my understanding of this data, provide an important point to how much spending could occur depending on the week and day of the year. For the value being predicted its exactly as the name implies. Its a set of data that contains average week spending amounts. This value should have some relation (seems to be non linear) to the values being used to predict it. When looking at the values being predicted they do seem to fall in range of what the actual averages are (around 13 - 14) which explains why the RMSE is 2.12. From my exploration on this notebook and the previous one I believe the relationship is relatively constant until we reach points towards the end of the year. Here it grows exponentially and then quickly falls back down. 