# Adım 1: Veri Toplama
- yfinance, investpy, quandl gibi kütüphaneler kullanılarak, 2005-01-01 tarihinden
itibaren aylık getirilere sahip hisse senetleri ve sektör verileri toplanacak.
- Web scraping ile sektörlerin ve hisse senetlerinin listesi çekilecek.


## Veri Toplama

In [10]:
#!pip install yfinance

In [2]:
import yfinance as yf
import pandas as pd
import requests
from bs4 import BeautifulSoup
from io import StringIO

In [4]:
def fetch_sectors_names():
    url = "https://stockanalysis.com/stocks/industry/sectors/"
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        df=pd.read_html(StringIO(str(soup.find_all("table"))))[0]
        
    else:
        print(f"Error: Failed to fetch data from page {url}")
        
    return df

def fetch_industry_names():
    url = "https://stockanalysis.com/stocks/industry/all/"
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        df=pd.read_html(StringIO(str(soup.find_all("table"))))[0]
        
    else:
        print(f"Error: Failed to fetch data from page {url}")
        
    return df
    
def fetch_data(sectors):
    url = f"https://stockanalysis.com/stocks/sector/{sectors}/"
    response = requests.get(url)
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        df=pd.read_html(StringIO(str(soup.find_all("table"))))[0]
        df.drop(columns='No.', inplace=True)
        
    else:
        print(f"Error: Failed to fetch data from page {url}")
        
    return df

In [6]:
sectors=fetch_sectors_names()
indusrty=fetch_industry_names()

In [8]:
sectors

Unnamed: 0,Sector Name,Stocks,Market Cap,Div. Yield,PE Ratio,Profit Margin,1D Change,1Y Change
0,Financials,1271,11.93T,0.17%,15.94,19.88%,0.97%,35.98%
1,Healthcare,1158,"8,171.66B",0.49%,61.02,3.16%,1.50%,12.89%
2,Technology,769,21.60T,0.48%,46.07,14.58%,1.60%,49.03%
3,Industrials,659,"5,927.88B",1.19%,29.37,7.28%,0.62%,26.64%
4,Consumer Discretionary,561,"8,902.09B",0.74%,30.78,6.34%,-1.13%,40.35%
5,Materials,266,"2,039.33B",1.70%,28.9,6.04%,0.46%,12.51%
6,Real Estate,263,"1,676.43B",3.76%,49.81,9.41%,1.35%,12.83%
7,Energy,251,"3,668.24B",3.10%,13.4,8.20%,0.05%,13.84%
8,Communication Services,245,"6,417.55B",1.47%,33.5,11.48%,-2.34%,40.12%
9,Consumer Staples,243,"4,216.88B",1.52%,29.13,4.97%,0.83%,25.71%


In [12]:
# Çektiğim verileri, data klasörü içerisinde tutuyorum
#mkdir ..\data\stock_sectors

In [14]:
fetch_data(sectors='energy').to_csv('../data/stock_sectors/energy.csv')

fetch_data(sectors='financials').to_csv('../data/stock_sectors/financials.csv')

fetch_data(sectors='healthcare').to_csv('../data/stock_sectors/healthcare.csv')

fetch_data(sectors='technology').to_csv('../data/stock_sectors/technology.csv')

fetch_data(sectors='utilities').to_csv('../data/stock_sectors/utilities.csv')

fetch_data(sectors='real-estate').to_csv('../data/stock_sectors/real-estate.csv')

fetch_data(sectors='materials').to_csv('../data/stock_sectors/materials.csv')

fetch_data(sectors='industrials').to_csv('../data/stock_sectors/industrials.csv')

fetch_data(sectors='consumer-staples').to_csv('../data/stock_sectors/consumer-staples.csv')

fetch_data(sectors='consumer-discretionary').to_csv('../data/stock_sectors/consumer-discretionary.csv')

fetch_data(sectors='communication-services').to_csv('../data/stock_sectors/communication-services.csv')

- Hangi sütunu baz alacağımı kararlaştırmak için sütun isimlerini yazdırdım
- ['Open'] sütununu baz alacağım

In [42]:
# AAPL verisini çek
aapl_data = yf.download("AAPL")

# Sütun isimlerini yazdır
print(aapl_data.columns)

[*********************100%***********************]  1 of 1 completed

MultiIndex([( 'Close', 'AAPL'),
            (  'High', 'AAPL'),
            (   'Low', 'AAPL'),
            (  'Open', 'AAPL'),
            ('Volume', 'AAPL')],
           names=['Price', 'Ticker'])





In [166]:
financials = pd.read_csv('../data/stock_sectors/financials.csv')
healthcare = pd.read_csv('../data/stock_sectors/healthcare.csv')
technology = pd.read_csv('../data/stock_sectors/technology.csv')

```
tickers = technology['Symbol'].tolist()

# Verileri çek
data = yf.download(tickers, start='2005-01-01')
```

`TypeError: expected string or bytes-like object, got 'float'. `
- Bu hatayı aldığım için ``yfinance.download()`` fonksiyonu bir ``float`` değeri ile karşılaşıyor, fakat bu fonksiyon yalnızca ``string`` türündeki sembollerle çalışabilir. ``technology[Symbol]`` sütununda bazı ``NaN`` veya ``float`` değerleri mevcut. Bu nedenle hata veriyor
    1. NaN değerlerini temizledim
    2. Sadece string değerleri aldım
- Ekstra olarak
    - `YFRateLimitError('Too Many Requests. Rate limited. Try after a while.')` hatası nedeniyle aralıklı veri indirdim

## **Tarihsel Veri Filtreleme ve Rastgele Firma Seçme**
    - Rastgele seçim yapmadan önce 2005 öncesi verisi olan hisseleri otomatik filtreledim
    - En büyük 3 endüstriden (Sağlık, Finans ve Teknoloji) rastgele 500 tane firma seçtim
    - Burada herhangi bir işlem yapmama gerek kalmadı çünkü zaten önceden hem boş sütunları temizledim hem de bütün veriyi 2005 tarihinden sonrası için ayarladım, kısacası sadece rastgele olarak firma seçmek kaldı

In [170]:
import time
import random

# Geçerli semboller (NaN veya float olmayanları aldım)
technology_tickers = technology['Symbol'].dropna().astype(str).tolist()
financials_tickers = financials['Symbol'].tolist()
healthcare_tickers = healthcare['Symbol'].tolist()

# Her bir kategoriden rastgele 500 şirket seçtim
technology_random = random.sample(technology_tickers, 500)
financials_random = random.sample(financials_tickers, 500)
healthcare_random = random.sample(healthcare_tickers, 500)

# Tüm şirketleri birleştirdim
all_random_tickers = technology_random + financials_random + healthcare_random

[*********************100%***********************]  100 of 100 completed
[*********************100%***********************]  100 of 100 completed
[*********************100%***********************]  100 of 100 completed
[*********************100%***********************]  100 of 100 completed
[*********************100%***********************]  100 of 100 completed
[*********************100%***********************]  100 of 100 completed

2 Failed downloads:
['RIBB', 'TDAC']: YFTzMissingError('$%ticker%: possibly delisted; no timezone found')
[*********************100%***********************]  100 of 100 completed

1 Failed download:
['CRD.A']: YFTzMissingError('$%ticker%: possibly delisted; no timezone found')
[*********************100%***********************]  100 of 100 completed

3 Failed downloads:
['DMAA', 'SVCC']: YFTzMissingError('$%ticker%: possibly delisted; no timezone found')
['FGMC']: YFInvalidPeriodError("%ticker%: Period 'max' is invalid, must be of the format 1d, 5d, etc.")

             A_A_P_L    A_D_T_N  A_K_A_M  A_L_A_R  A_L_G_M   A_M_K_R  A_P_L_D  \
Date                                                                            
2005-01-03  0.975804  13.428749    13.00      NaN      NaN  6.474808      NaN   
2005-01-04  0.960890  13.134838    12.70      NaN      NaN  6.143978      NaN   
2005-01-05  0.970982  12.672986    12.19      NaN      NaN  5.595747      NaN   
2005-01-06  0.974146  12.575022    12.06      NaN      NaN  5.293273      NaN   
2005-01-07  0.979117  12.617007    12.30      NaN      NaN  5.208204      NaN   

            A_P_P_N  A_S_Y_S  B_K_K_T  ...  T_L_P_H  T_X_M_D  T_Y_R_A  \
Date                                   ...                              
2005-01-03      NaN     4.15      NaN  ...      NaN      NaN      NaN   
2005-01-04      NaN     4.19      NaN  ...      NaN      NaN      NaN   
2005-01-05      NaN     4.20      NaN  ...      NaN      NaN      NaN   
2005-01-06      NaN     4.10      NaN  ...      NaN      NaN      N

In [None]:
# Semboller için listeyi 100'lük parçalara böldüm ki veri çekmesi kolay olsun
chunks = [all_random_tickers[i:i+100] for i in range(0, len(all_random_tickers), 100)]

# Veriyi parça parça çektim
all_data = []
for chunk in chunks:
    try:
        data = yf.download(chunk, start='2005-01-01')
        all_data.append(data)
        time.sleep(15)  # 15 saniye bekleme, API limitini aşmamak için
    except yf.download.YFRateLimitError:
        print("Rate limit exceeded. Retrying in 60 seconds...")
        time.sleep(60)

In [252]:
# Sonuçları birleştir
final_data = pd.concat(all_data, axis=1)

# "Open" fiyatlarını al
data_open = final_data['Open']

# Günlük verileri aylığa çeviriyoruz.
# Örneğin; her ayın ilk işlem günündeki "Open" değerini alabilirsiniz.
data_open_monthly = data_open.resample('M').first()

# Sonuç
print(data_open_monthly.head())


Ticker          AAPL       ADTN   AKAM  ALAR  ALGM      AMKR  APLD  APPN  \
Date                                                                       
2005-01-31  0.975804  13.428749  13.00   NaN   NaN  6.474808   NaN   NaN   
2005-02-28  1.160630  12.526032  13.10   NaN   NaN  4.206262   NaN   NaN   
2005-03-31  1.355400  13.165858  11.01   NaN   NaN  4.206263   NaN   NaN   
2005-04-30  1.268032  12.462925  12.72   NaN   NaN  3.639126   NaN   NaN   
2005-05-31  1.090887  14.634976  11.96   NaN   NaN  3.166512   NaN   NaN   

Ticker      ASYS  BKKT  ...  TLPH  TXMD  TYRA  VCYT  VEEV  VERV   VRTX  \
Date                    ...                                              
2005-01-31  4.15   NaN  ...   NaN   NaN   NaN   NaN   NaN   NaN  10.70   
2005-02-28  3.69   NaN  ...   NaN   NaN   NaN   NaN   NaN   NaN  10.30   
2005-03-31  3.30   NaN  ...   NaN   NaN   NaN   NaN   NaN   NaN  11.54   
2005-04-30  3.25   NaN  ...   NaN   NaN   NaN   NaN   NaN   NaN   9.31   
2005-05-31  3.65   NaN 

In [254]:
# Uzun formata çevirme
long_data = data_open_monthly.reset_index().melt(id_vars='Date', var_name='Ticker', value_name='Open')

# Sektör bilgilerini ekleme
sector_map = {}
for ticker in financials_random:
    sector_map[ticker] = 'Financials'

for ticker in healthcare_random:
    sector_map[ticker] = 'Healthcare'

for ticker in technology_random:
    sector_map[ticker] = 'Technology'

long_data['Sector'] = long_data['Ticker'].map(sector_map)

# Eksik verileri temizle
long_data = long_data.dropna(subset=['Open'])

In [274]:
# İlk 5 satırı kontrol et
long_data.head()

# Datayı kayıt et
long_data.to_csv('../data/stock_sectors/combined_data.csv', index=False)

In [270]:
!pip install --upgrade tsfresh



In [272]:
import tsfresh
extract_features = long_data

data_extract_features = tsfresh.extract_features(extract_features, column_id='Ticker', column_sort='Date', 
                                       default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters())

ImportError: cannot import name 'cwt' from 'scipy.signal' (D:\Anaconda\Lib\site-packages\scipy\signal\__init__.py)

In [192]:
# Veriyi CSV dosyasına kaydet (Tekrar Tekrar çalıştırmaya gerek yok)
data_open.to_csv('../data/stock_sectors/combined_data.csv')

In [48]:
# CSV dosyasını yükle
data = pd.read_csv('../data/stock_sectors/combined_data.csv', index_col=0, parse_dates=True)

# Şirket isimlerini düzenledim
data.columns = [''.join(col.split('_')) for col in data.columns]

# Aylık bazda veriyi resample ettim
data_monthly_mom = data.resample('M').last().pct_change() + 1

# Sadece eksik olmayan semboller kalmasını istiyorum
data_monthly_mom = data_monthly_mom.dropna(axis=1, how='all')

# Sonuç
data_monthly_mom

Unnamed: 0_level_0,AAPL,AEYE,AGYS,AMBA,ARBE,ASTC,ATCH,AZ,BELFA,BKTI,...,TNYA,TOI,TSHA,TWST,TXG,UFPT,VCEL,WBA,XTNT,ZIMV
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-31,,,,,,,,,,,...,,,,,,,,,,
2005-02-28,1.198176,,1.140896,,,1.000000,,,0.901253,0.823045,...,,,,,,1.005714,0.753623,0.992865,,
2005-03-31,0.950089,,1.033490,,,1.141975,,,1.020246,1.170000,...,,,,,,1.548295,0.842308,1.054200,,
2005-04-30,0.851590,,0.688177,,,0.918919,,,0.905337,1.051282,...,,,,,,0.755963,0.936073,0.955041,,
2005-05-31,1.124759,,1.148858,,,0.829412,,,1.088679,0.934959,...,,,,,,0.779126,1.297561,1.061595,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-10-31,0.996957,0.917012,0.909936,1.057548,0.959184,0.914498,0.616601,2.460840,1.035413,1.275686,...,1.040404,1.048232,0.893720,0.922225,0.730804,0.846161,1.061786,1.012048,0.818182,0.876756
2024-11-30,1.024978,1.230769,1.368437,1.243940,0.952128,1.006775,1.365385,1.225092,0.955665,1.178936,...,1.567961,0.527607,1.627027,1.145819,0.948687,1.180235,1.272727,1.012535,0.759259,1.064822
2024-12-31,1.075082,0.584559,1.006345,0.983919,1.212291,0.932705,0.788732,0.986446,0.945014,1.054405,...,0.452012,1.866279,0.564784,0.973067,0.929169,0.750938,0.954969,1.024229,1.048781,0.956224
2025-01-31,0.979203,1.218239,0.685581,1.084488,1.165899,0.974026,0.446429,1.129771,0.990941,0.996482,...,0.732877,3.084112,0.876471,1.084513,1.051975,1.140239,1.067570,1.069892,1.279070,1.014306


## **Öznitelik Çıkarımı ve Seçme**
**Öznitelik Çıkarımı:** 

    - ``tsfresh`` kütüphanesi ile otomatik özellik çıkarımı yapılacak.
    - Bu süreçte, zaman serisi verilerinden istatistiksel özellikler (ortalama, standart sapma, otokorelasyon vb.) çıkarılacak.

In [110]:
#!pip install tsfresh

Collecting tsfresh

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.0 requires FuzzyTM>=0.4.0, which is not installed.



  Obtaining dependency information for tsfresh from https://files.pythonhosted.org/packages/11/04/5980fc134d618f77516b42d9babcb103e6d539a6386c05e649ac1dab6422/tsfresh-0.20.3-py2.py3-none-any.whl.metadata
  Downloading tsfresh-0.20.3-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting stumpy>=1.7.2 (from tsfresh)
  Obtaining dependency information for stumpy>=1.7.2 from https://files.pythonhosted.org/packages/43/0f/305bc39f513eb7cb6406f1cd445f58f2b260526693afbe900dc6e9802410/stumpy-1.13.0-py3-none-any.whl.metadata
  Downloading stumpy-1.13.0-py3-none-any.whl.metadata (28 kB)
Collecting scipy>=1.14.0 (from tsfresh)
  Obtaining dependency information for scipy>=1.14.0 from https://files.pythonhosted.org/packages/af/25/caa430865749d504271757cafd24066d596217e83326155993980bc22f97/scipy-1.15.1-cp311-cp311-win_amd64.whl.metadata
  Downloading scipy-1.15.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.8 kB ? eta -:--:--
     --------------------

In [116]:
# Veriyi hazırlama (data_monthly_mom örneğini kullanarak)
# Data'nın index'ini zaman ve her sembolü 'id' olarak ayarlıyoruz.
extract_features = data_monthly_mom

data_extract_features = tsfresh.extract_features(extract_features, column_id='station', column_sort='timestamp', 
                                       default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters())


ImportError: cannot import name 'cwt' from 'scipy.signal' (D:\Anaconda\Lib\site-packages\scipy\signal\__init__.py)