# Homework 1
Homeworks are an individual assessment, you should not work in groups.

You will be turning in:

1. A [README.md](https://github.com/cmparlettpelleriti/CPSC392ParlettPelleriti/blob/master/Admin/READMEexample.md) with all the relevant information
2. An .ipynb with just your code (show all code necessary for the analysis, but remove superfluous code)
3. A PDF with your Report (rendered via Quarto)


## Data
We're going to be using some [clothing store data](https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/boutique.csv) to help the company predict how much their customers will spend with them per year.

- `gender`: self-disclosed gender identity, `male`, `female`, `nonbinary` or `other`
- `age`: current age of customer
- `height_cm`: self-reported height converted to centimeters
- `waist_size_cm`: self-reported waist size converted to centimeters
- `inseam_cm`: self-reported inseam (measurement from crotch of pants to floor) converted to centimeters
- `test_group`: whether or not the customer is in an experimental test group that gets special coupons once a month. `0` for no, `1` for yes.
- `salary_self_report_in_k`: self-reported salary of customer, in thousands
- `months_active`: number of months customer has been part of the clothing store's preferred rewards program
- `num_purchases`: the number of purchases the customer has made (a purchase is a single transaction that could include multiple items)
- `amount_spent_annual`: the average amount the customer has spent at the store per year


## 1. Modeling
- Drop Missing Values and Reset Indices if needed.
- Using *Train-Test-Split Model Validation* with an 80/20 split and `sklearn` `Pipeline`s, build **two** models that predict the average amount the customer spends in a year using all the other variables.
- Z-score continuous/interval variables, and One Hot Encode categorical variables (when needed) before fitting your models.
- Try both a typical **Linear Regression** model and a **Polynomial Regression** model (using `PolynomialFeatures()`)
- Once the model is trained, calculate the *MSE, MAE, MAPE, and $R^2$* for both the training and testing sets of both models.

## 2. Graphs
Choose 2 of the following questions to answer. Build at *least* one ggplot to answer each question (you can also do other calculations in addition), and write a detailed written answer based on the graph and calculations in your report (below). You do not NEED to use your model for these questions, they can be purely descriptive.

- Does being in the experimental `test_group` actually increase the amount a customer spends at the store? Is this relationship different for the different genders?
- Does making more money (salary) tend to increase the number of purchases someone makes? Does it increase the total amount spent?
- In which year did the store's *customers* make the most money? Were the store's sales highest in those years?
- People who are not your "average" size often find it difficult to buy clothes in traditional stores. Is there a relationship between inseam and amount spent in the store annually? Is there a relationship between height and amount spent in the store annually?
- In this dataset, is there a relationship between salary and height? Is it different for the different genders?
- The store is interested in whether their customer base has changed over time. Present the minimum, maximum, and average height, waist size, and inseam for each year.


For all ggplots, make sure you make changes so that the data viz is effective, clear, and does not contain distracting elements, graphs will be graded both on correctness (did you plot the right thing) as well as on effectiveness (does this graph thoughtfully demonstrate the principles we learned in our data viz lectures).

## 3. Report

[TEMPLATE HERE](https://github.com/cmparlettpelleriti/CPSC392ParlettPelleriti/blob/master/Homework/HomeworkTemplate.qmd)

Your Technical Report is a way to practice presenting and formatting your results like you would in industry. Make sure your report is clear, and explains things clearly. Using Quarto ([download](https://quarto.org/docs/download/), [getting started](https://quarto.org/docs/get-started/)) write a report that has the following sections:

1. **Introduction**: description of the problem (e.g. what are you predicting? what variables do you have available? How might this model be useful if you are successful). You should end with a sentence or two about what the impact of these models could be.

2. **Methods**: describe your models in detail (as if explaining them to the store's CEO), as well as any pre-processing you had to do to the data.

3. **Results**: How well did your model perform according to the various metrics, was the model overfit (how can you tell)? What do those performance metrics tell you about the model? Did you need `PolynomialFeatures` (which includes both ploynomial features and interactions)?  How much do you trust the results of your model (in other words, would you be confident telling the store that they should use the model? Why or why not? Are there any caveats you'd give them?) Also answer the two questions you chose from part 2 above. Include the image, a caption as well as your written answer.

4. **Discussion/Reflection**: A few sentences about what you learned from performing these analyses, and at least one suggestion for what you'd add or do differently if you were to perform this analysis again in the future.

In [None]:
import warnings
warnings.filterwarnings('ignore')

# data imports
import pandas as pd
import numpy as np
from plotnine import *

# modeling imports
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV # Linear Regression Model
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, SplineTransformer, OneHotEncoder #Z-score variables, Polynomial
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error, mean_absolute_error #model evaluation
from sklearn.model_selection import train_test_split

# pipeline imports
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.compose import make_column_transformer

%matplotlib inline

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/boutique.csv")
df.to_csv("boutique.csv")

df.dropna(inplace = True)
df.reset_index(inplace = True, drop = True)

df['year'] = df['year'].astype('category')
df['gender'] = df['gender'].astype('category')

predictors = ['year', 'gender', 'age', 'height_cm', 'waist_size_cm', 'inseam_cm', 'test_group', 'salary_self_report_in_k', 'months_active', 'num_purchases']
contin = ['age', 'height_cm', 'waist_size_cm', 'inseam_cm', 'test_group', 'salary_self_report_in_k', 'months_active', 'num_purchases']
X = df[predictors]
y = df['amount_spent_annual']

preprocess = make_column_transformer((StandardScaler(), contin),
                            (OneHotEncoder(), ['year']),
                            (OneHotEncoder(), ['gender']),
                            remainder = "passthrough")

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

lr = LinearRegression()

pipe = Pipeline([("pre", preprocess),
                ("linearregression", lr)])

pipe.fit(X_train,y_train)

# predict
y_pred_train = pipe.predict(X_train)
y_pred_test = pipe.predict(X_test)

# assess
print("MSE: ", mean_squared_error(y_test, y_pred_test))
print("MAE: ", mean_absolute_error(y_test, y_pred_test))
print("MAPE: ", mean_absolute_percentage_error(y_test, y_pred_test))
print("R^2: ", r2_score(y_test, y_pred_test))

MSE:  12567.073484252625
MAE:  88.93358296326201
MAPE:  0.12522964676277779
R^2:  0.5280557224993601


In [7]:
z = make_column_transformer((StandardScaler(), contin),
                            (OneHotEncoder(), ['year']),
                            (OneHotEncoder(), ['gender']),
                            remainder = "passthrough")

lr2 = LinearRegression()

pipe2 = Pipeline([("zscore", z),
                ("poly", PolynomialFeatures(degree = 2)),
                ("linearregression", lr2)])

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)  

pipe2.fit(X,y)

# predict
y_pred = pipe2.predict(X)

# assess
print("MSE : ", mean_squared_error(y, y_pred))
print("MAE : ", mean_absolute_error(y, y_pred))
print("MAPE: ", mean_absolute_percentage_error(y, y_pred))
print("R2  : ", r2_score(y, y_pred))

MSE :  3065.168428655554
MAE :  44.20537017503129
MAPE:  0.05917063112271748
R2  :  0.8870995391841576
