# Assignment Prep - K Means

For this part of the assignment, we will use Yahoo finance. Yfinance is the package that will construct our dataset.

- You need to pick 5 stocks (Tickers). Reference - https://stockanalysis.com/stocks/

- In the 2nd cell, you will need to replace the text "YOUR TICKERS GO HERE" with comma separated Tickers i.e. ["AAPL","MSFT","SPY","KO","AMZN"].

- Do not use the same stocks outlined in the above example.

- The 1st cell contains installation commands for specific package versions. Do not change them else the dataset generation code will not work.

- Once you enter your tickers, run the cell, in the folders tab in the left side of your colab screen, you will be able to see the .csv files.

In [None]:
!pip install yfinance==0.1.62
!pip install pandas==1.3.5

In [None]:
from pandas_datareader import data as web
import os
import pandas as pd
import yfinance as yf

def get_stock(ticker, start_date, end_date):
    try:
        df = yf.download(ticker, start=start_date, end=end_date)
        df['Ticker'] = ticker
        df['Return'] = df['Adj Close'].pct_change()
        df['Return'].fillna(0, inplace = True)
        df['Date'] = df.index
        df['Date'] = pd.to_datetime(df['Date'])
        df['Month'] = df['Date'].dt.month
        df['Year'] = df['Date'].dt.year
        df['Day'] = df['Date'].dt.day
        for col in ['Open', 'High', 'Low', 'Close', 'Adj Close']:
            df[col] = df[col].round(2)
        df['Weekday'] = df['Date'].dt.day_name()
        df['Week_Number'] = df['Date'].dt.strftime('%U')
        df['Year_Week'] = df['Date'].dt.strftime('%Y-%U')
        col_list = ['Ticker','Date', 'Year', 'Month', 'Day', 'Weekday',
                    'Week_Number', 'Year_Week', 'Open',
                    'High', 'Low', 'Close', 'Volume', 'Adj Close',
                    'Return']
        df = df[col_list]
        return df
    except Exception as error:
        print(error)
        return None

ticker=["YOUR TICKERS GO HERE"] # ["AAPL","MSFT","SPY","KO","AMZN"]
for stock in ticker:
    try:
        input_dir = os.getcwd()
        output_file = os.path.join(input_dir, stock + '.csv')
        df = get_stock(stock, start_date='2022-01-01', end_date='2022-12-31')
        df.to_csv(output_file, index=False)
        print('wrote ' + str(len(df)) + ' lines to file: ' + output_file)
    except Exception as e:
        print(e)
        print('Failed to fetch data for', stock)

### This cell is for installing any python packages you want to use

In [None]:
!pip install your-package-name

# Question 1 - Feature Engineering

Merge all 5 datasets.

Create four new features - **(9 Points)**

- Mean Return
- Volatility (Standard Deviation)
- High Low Spread (High price - Low price)
- Average Trading volume

Basically, you will need to create a new dataset, wherein each week for each ticker will be represented by a single row. This dataset should have columns - Ticker, Week_Number, Mean_Return, Volatility, High_Low_Spread, Average_Trading_Volume. For example, the mean_return for a week with respect to a stock will be the mean of returns for all the days in that week.

*Hint: For each ticker, there should be around 52 data points in the new dataset.*

# Question 2 - K Means


On the new dataset created in the previous question, perform K means clustering with k = 3 *based on the following features only: Volatility, High Low Spread, and Average Trading volume*. Use a custom initialization by randomly selecting 3 data points from the dataset as initial centroids.

*Use these features for clustering in all upcoming questions.*

*Perform any necessary pre-processing*

**(4 points)**

Create two 3D figures using plotly:
- a plot with points colored based on the clustering obtained above.
- a plot with points colored based on the clustering obtained by initializing the centroids to some outliers. You are welcome to find these outliers visually or numerically.

Which clustering scheme was better? How did you make that evaluation? Explain thoroughly.

*For the upcoming questions, answer based on the best of the two clusterings you just plotted.*

**(3 Points)**

Looking at the resulting clustering

1. Do you think the different clusters highlight specific segments or trends in the stock market?
2. Is the obtained clustering related to "Mean Return"? Explain you answer with the proper visualizations.

**(2 Points)**

*Your explanation goes here ... (i.e. edit this markdown
cell by double clicking here)*

In [None]:
# Python code if any

# Question 3 - Optimal K

### For both Optimal K questions, 2 <= k <= 10.

### Also present any visualizations if necessary.

Write python code to determine optimal k using the elbow method. **(2 Points)**

Write python code to determine optimal k using the silhouette method. **(2 Points)**

Reference: https://vitalflux.com/kmeans-silhouette-score-explained-with-python-example/

Considering both methods, what do you think should be the optimal K. Explain in details? **(2 Points)**

*Your explanation goes here ... (i.e. edit this markdown
cell by double clicking here)*

# Question 4 - Hierarchical Clustering

Perform Hierarchical Clustering with K ranging from 2 to 11. Use the silhouette score to determine optimal K. Plot a graph with K on the X-axis and silhouette score on the Y-axis. **(4 Points)**

After performing hierarchical clustering, you decide to apply your model to new incoming data (test data). Assuming the test data comes in a dataframe similar in shape to the original data you used to create the model, how would you use the same clustering arrangement with the new dataset? Explain any observations or challenges you find relevant. **(2 Points)**

*Your explanation goes here ... (i.e. edit this markdown
cell by double clicking here)*