## Purpose

This homework is designed to give you practice with scikitlearn.  Please note that this is **NOT** a machine learning course.  Using the library the important part, not designing 'good' models.  The requirements are fairly low on this.

## Requirements

This is a group assignment.  Take a data set (either one provided, or using your group project data set) and work with Scikit Learn to train some aspect of your data set.

Some data sets may appear to be something you wouldn't use ML to solve in a 'real life' situation, but this again is just for practice.  So the models may not come out useful, and that's okay.

Each student in the group should do 2 ML type implementations using Scikit learn.  Since there are likely less applicable algorithms than there are implementations, work at looking at different slices of information (See help video).


## Required Hand-in

One notebook should be handed in.  Following best practices I've outlined.  This homework is graded as a group homework.  The data set you pick to do this practice can be either one I'm providing as part of the repo, or of your group project.

Please label each implementation with the original author (in code, comment above the implementation).

Do not use the .todo as your template.  Analysis of the models performance should be minimal (see one example on block 10 on https://github.com/TheDarkTrumpet/BAIS-6040-0EXP-Sum2021/blob/master/Notebooks/02-Analysis/09.03.01-Classification.ipynb ).

I do recommend that you lean on whoever in your group has a bit more knowledge of ML concepts. to pick the implementation that appears to yield the best results.  If you're using your group data set, this implementation can then be copied/pasted into the group project.

## Other notes

This homework will be graded as a group.  Meaning, you all will get the same grade, regardless if a specific student's implementation is poorly done.  It will count for 75 points.  I strongly recommend you discuss as a group who will do what, then meet up at least a few days before the assignment is to be turned in and do a code review and merge of the individual notebooks.

Imports

In [1]:
import yfinance as yf
import pandas as pd
from pytrends.request import TrendReq
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import random as rnd
import math

import math 
from sklearn.metrics import explained_variance_score, mean_absolute_error, r2_score, mean_squared_error

rnd.seed(1024)

# Setup Keyword Search, and pull the search results using pytrends for January 1 - June 30

In [2]:
#pytrends = TrendReq(hl='en-US', tz=360)
#keywords = ["AMC"] 
#pytrends.build_payload(keywords, timeframe='2021-01-01 2021-06-30', geo='US')
#amcSearchResults = pytrends.interest_over_time()

#amcSearchResults = amcSearchResults.rename(columns={'AMC': 'Search Interest'})
#amcSearchResults
amcMergedDataFrame = pd.read_csv('AMCDataClean.zip')  

# Use Yahoo Finance to pull Stock Data from January 1 - June 30

In [3]:
#amcStockInfo = yf.download("AMC", start="2021-01-01", end="2021-06-30", interval="1d")
#amcStockInfo.shape
#amcStockInfo["Amount Changed"] = amcStockInfo["Open"] - amcStockInfo["Close"]
#amcMergedDataFrame
#amcMergedDataFrame = amcMergedDataFrame.drop(columns=['Adj Close'])
#amcMergedDataFrame["Days Spread"] = amcMergedDataFrame["High"] - amcMergedDataFrame["Low"]
#amcMergedDataFrame.to_csv('AMCDataClean.zip', index=False) 
amcMergedDataFrame

Unnamed: 0,Search Interest,Open,Close,Volume,Amount Changed,Days Spread
0,2,2.200000,2.010000,29873800,0.190000,0.200000
1,3,1.990000,1.980000,28148300,0.010000,0.120000
2,2,2.030000,2.010000,67363300,0.020000,0.260000
3,2,2.080000,2.050000,26150500,0.030000,0.090000
4,3,2.090000,2.140000,39553300,-0.050000,0.140000
...,...,...,...,...,...,...
118,17,57.040001,58.299999,116291800,-1.259998,4.299999
119,16,57.980000,56.700001,80351200,1.279999,3.099998
120,19,55.750000,54.060001,77596900,1.689999,3.320000
121,16,55.099998,58.110001,99310200,-3.010002,5.029999


# Merge the data into one table

In [4]:
#amcMergedDataFrame = amcSearchResults.merge(amcStockInfo, how='inner', left_index=True, right_index=True)
#amcMergedDataFrame


#amcMergedDataFrame.read_csv('AMCData.zip')  
#amcMergedDataFrame = amcMergedDataFrame.drop(columns=['isPartial'])
#amcMergedDataFrame
#amcMergedDataFrame.to_csv('AMCDataClean.zip', index=False) 

featureColumns=['Search Interest', 'Open']
target = 'Close'

X=amcMergedDataFrame[featureColumns]
y=amcMergedDataFrame[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
#print("Population:\n",y.value_counts(normalize=True)*100)
#print("Train:\n", y_train.value_counts(normalize=True)*100)
#print("Test:\n", y_test.value_counts(normalize=True)*100)

In [5]:
lr = LinearRegression()
lr

LinearRegression()

In [6]:
lr.fit(X_train, y_train)

LinearRegression()

In [7]:
lr.score(X_train, y_train) 

0.9691667374025036

In [8]:
lr.score(X_test, y_test) 

0.9709538923793088

In [9]:
def printMetrics(test, predictions):
    print(f"Score: {explained_variance_score(test, predictions):.2f}")
    print(f"MAE: {mean_absolute_error(test, predictions):.2f}")
    print(f"RMSE: {math.sqrt(mean_squared_error(test, predictions)):.2f}")
    print(f"r2: {r2_score(test, predictions):.2f}")

In [10]:
predictions = lr.predict(X_test)
printMetrics(y_test, predictions)

Score: 0.97
MAE: 1.91
RMSE: 3.13
r2: 0.97


# Predict some new examples

In [11]:
numElements = 3
sampleStockTrend = []
for _ in range(numElements):
    dict = {}
    for column in X.columns:
        min = 0  # assume min = 0
        maxValue = round(max(amcMergedDataFrame[column].values))
        dict[column] = rnd.randint(min, maxValue)
    sampleStockTrend.append(dict)
sampleStockTrend

[{'Search Interest': 2, 'Open': 30},
 {'Search Interest': 49, 'Open': 20},
 {'Search Interest': 66, 'Open': 6}]

In [12]:
pdSampleStockTrend = pd.DataFrame.from_dict(sampleStockTrend)
pdSampleStockTrend

Unnamed: 0,Search Interest,Open
0,2,30
1,49,20
2,66,6


In [13]:
predictions = lr.predict(pdSampleStockTrend)
predictions

array([29.84559392, 22.07508896,  8.74029546])

In [15]:
pdPredictedStockTrend = pdSampleStockTrend.copy()
pdPredictedStockTrend['Price Prediction'] = predictions
pdPredictedStockTrend

Unnamed: 0,Search Interest,Open,Price Prediction
0,2,30,29.845594
1,49,20,22.075089
2,66,6,8.740295
