<a href="https://colab.research.google.com/github/OCanSagbas/AI-DataScience/blob/main/FeatureEngineering%26MachineLearningwithTimeSeriesFinanceDatas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Bu proje icerisinde amac farkli sektorlerden elde edilmis zaman serileri uzerinden elde edilen faktorler uzerine kurulmus bir classification modeli kurarak benzerlik calismasi yapmaktir.

Mesela ilgilenilen bir hisse senedi X olsun, bunun bulundugu sektor bazli diger hisse senetlerin davranislarindan farkli davrandigini dusunelim. Yani sektor icinde bir artis gozlemlenirken bu hisse senetinde bir hareketlilik olmasin. Dolayisiyla, hangi sektore daha cok benziyor sorusuna cevap verebilirsek, o sektor'un hareketlerine gore bir hipotez kurabiliriz.

Bu proje, asagidaki surecleri kapsayacak:

- Sektorlerin listesine bir web-scraping ile erisilmesi ve verilerin elde edilmesi (`yfinance`, `investpy`, [`quandl`]((https://docs.data.nasdaq.com/v1.0/docs/python-installation)))
- 2005-01-01 yilindan itibaren aylik getirelerden olusan serilerin elde edilmesi
- 3 buyuk sektor uzerinden getirilerin faktorleri(momentum gibi) hesaplanmasi
- Bu momentum serileri uzerinden bir tsfresh ile feature engineering yapilmasi (imputing, encoding, transformation, ve daha fazlasi)
- Yeni elde edilmis feature ve sektor siniflari uzerinden bir model kurulmasi (en iyi model secmesi)
- Diger sektorlerden ornekler alip ayni feature engine yontemleri yaptik sonra hangi sektore benzedigine karar vermek.
- **Bonus** Mesela Real-Estate sektorunde bulunan butun sembollerin tahmini edildikten sonra cogunluk hangi sektore(T,F,H) benzedigi bilgisine erismek.

In [None]:
import yfinance
import pandas as pd
import requests
from bs4 import BeautifulSoup

def fetch_sectors_names():
    url = "https://stockanalysis.com/stocks/industry/sectors/"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        df=pd.read_html(str(soup.find_all("table")))[0]
    else:
        print(f"Error: Failed to fetch data from page {url}")

    return df
def fetch_data(sectors):
    url = f"https://stockanalysis.com/stocks/sector/{sectors}/"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        df=pd.read_html(str(soup.find_all("table")))[0]
        df.drop(columns='No.', inplace=True)
    else:
        print(f"Error: Failed to fetch data from page {url}")

    return df

In [None]:
sectors=fetch_sectors_names()
sectors

Unnamed: 0,Sector Name,Stocks,Market Cap,Div. Yield,PE Ratio,Profit Margin,1D Change,1Y Change
0,Financials,1393,"9,708.44B",2.52%,14.51,17.56%,-1.00%,32.23%
1,Healthcare,1195,"7,986.48B",0.48%,61.13,3.27%,-0.95%,26.65%
2,Technology,786,18.70T,0.44%,46.12,13.64%,-0.42%,54.45%
3,Industrials,649,"5,510.16B",1.12%,27.35,7.20%,-0.93%,35.13%
4,Consumer Discretionary,574,"7,197.32B",0.72%,25.36,6.31%,-0.45%,31.93%
5,Materials,266,"2,136.40B",1.51%,27.51,6.63%,-1.09%,24.40%
6,Real Estate,255,"1,410.11B",4.23%,42.0,9.88%,-1.17%,6.48%
7,Communication Services,254,"5,711.12B",1.16%,29.84,10.05%,-0.21%,42.33%
8,Energy,253,"3,702.56B",3.24%,11.01,9.65%,-1.05%,23.10%
9,Consumer Staples,248,"3,858.67B",1.43%,27.32,4.88%,-0.79%,13.87%


## Sektor listerine erismek

Yukarida yazilan fonksyionlar ile hangi sembollerin hangi sektorlerde oldugu bilgisine erisim saglanabilir ve asagidaki betikler yardimi ile `.csv` dosyalarinda saklayabiliriz. Sonrasinda, sektor bazli sembollere ait verileri indirilebilir ve siniflandirilabilir.

In [None]:
fetch_data(sectors='energy').to_csv('../content/energy.csv')
fetch_data(sectors='financials').to_csv('../content/financials.csv')
fetch_data(sectors='healthcare').to_csv('../content/healthcare.csv')
fetch_data(sectors='technology').to_csv('../content/technology.csv')
fetch_data(sectors='utilities').to_csv('../content/utilities.csv')
fetch_data(sectors='real-estate').to_csv('../content/real-estate.csv')
fetch_data(sectors='materials').to_csv('../content/materials.csv')
fetch_data(sectors='technology').to_csv('../content/technology.csv')
fetch_data(sectors='industrials').to_csv('../content/industrials.csv')
fetch_data(sectors='consumer-staples').to_csv('../content/consumer-staples.csv')
fetch_data(sectors='consumer-discretionary').to_csv('../content/consumer-discretionary.csv')
fetch_data(sectors='communication-services').to_csv('../content/communication-services.csv')

In [None]:
finance = pd.read_csv('../content/financials.csv')
finance.Symbol

0      BRK.B
1        JPM
2          V
3         MA
4        BAC
       ...  
972      DXF
973     NCPL
974     RELI
975     LGHL
976    AIMAU
Name: Symbol, Length: 977, dtype: object

## Veriye erismek
Diyelim ki, finans sektorunden `HSBC` sembolu icin verileri indirmek istiyoruz. Bu adim icin `yfinance` kullanilabilir. Oncelikle `.Ticker` ile bir object olusturup onun uzerinden dogru hissemi olduguna dair bilgileri teyit edebiliriz. Sonrasinda `.get_history_metadata()` ile sembolun metedatasina erisim saglayabiliriz. Sonrasinda, `.history(period='3y')` ile 3 yillik veriyi calisma ortamimiza indirebiliriz.

In [None]:
import yfinance as yf
ticker_name = yfinance.Ticker("HSBC")
ticker_name.info

{'address1': '8 Canada Square',
 'city': 'London',
 'zip': 'E14 5HQ',
 'country': 'United Kingdom',
 'phone': '44 20 7991 8888',
 'fax': '44 20 7992 4880',
 'website': 'https://www.hsbc.com',
 'industry': 'Banks - Diversified',
 'industryKey': 'banks-diversified',
 'industryDisp': 'Banks - Diversified',
 'sector': 'Financial Services',
 'sectorKey': 'financial-services',
 'sectorDisp': 'Financial Services',
 'longBusinessSummary': 'HSBC Holdings plc provides banking and financial services worldwide. The company operates through Wealth and Personal Banking, Commercial Banking, and Global Banking and Markets segments. The Wealth and Personal Banking segment offers retail banking and wealth products, including current and savings accounts, mortgages and personal loans, credit and debit cards, and local and international payment services; and wealth management services comprising insurance and investment products, global asset management services, investment management, and private wealth 

In [None]:
ticker_name.get_history_metadata()

ERROR:yfinance:HSBC: Period '1wk' is invalid, must be one of ['1d', '5d', '1mo', '3mo', '6mo', '1y', '2y', '5y', '10y', 'ytd', 'max']


{'currency': 'USD',
 'symbol': 'HSBC',
 'exchangeName': 'NYQ',
 'fullExchangeName': 'NYSE',
 'instrumentType': 'EQUITY',
 'firstTradeDate': 932131800,
 'regularMarketTime': 1717012802,
 'hasPrePostMarketData': True,
 'gmtoffset': -14400,
 'timezone': 'EDT',
 'exchangeTimezoneName': 'America/New_York',
 'regularMarketPrice': 43.74,
 'fiftyTwoWeekHigh': 43.93,
 'fiftyTwoWeekLow': 43.67,
 'regularMarketDayHigh': 43.93,
 'regularMarketDayLow': 43.67,
 'regularMarketVolume': 1172017,
 'chartPreviousClose': 44.21,
 'previousClose': 44.16,
 'scale': 3,
 'priceHint': 2,
 'currentTradingPeriod': {'pre': {'timezone': 'EDT',
   'start': 1717056000,
   'end': 1717075800,
   'gmtoffset': -14400},
  'regular': {'timezone': 'EDT',
   'start': 1717075800,
   'end': 1717099200,
   'gmtoffset': -14400},
  'post': {'timezone': 'EDT',
   'start': 1717099200,
   'end': 1717113600,
   'gmtoffset': -14400}},
 'tradingPeriods':                                           pre_start                   pre_end  \
 

In [None]:
data=ticker_name.history(period='2y')
data.tail()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-05-22 00:00:00-04:00,44.509998,44.599998,44.099998,44.209999,1181200,0.0,0.0
2024-05-23 00:00:00-04:00,44.299999,44.369999,43.790001,43.799999,1392400,0.0,0.0
2024-05-24 00:00:00-04:00,44.360001,44.650002,44.310001,44.380001,1198800,0.0,0.0
2024-05-28 00:00:00-04:00,44.029999,44.400002,43.950001,44.16,1506300,0.0,0.0
2024-05-29 00:00:00-04:00,43.869999,43.93,43.669998,43.740002,1185400,0.0,0.0


Simdi ise, belirlenen sembollerin verilerini belli bir tarih sonrasinda cekebiliriz. Sonrasinda aylik getirileri hesaplayabiliriz. Proje kapsaminda bu yontem kullanilacak.

In [None]:
def fetch_sector_data(sector_name):
    symbols = pd.read_csv(f'../content/{sector_name}.csv')
    symbol_list = symbols["Symbol"].tolist()
    valid_symbols = [sym for sym in symbol_list if isinstance(sym, str)]
    data = yf.download(valid_symbols, start='2005-01-01')
    data_close = data['Open'].resample('W-MON').last().pct_change() + 1
    return data_close

# Sağlık sektörü verilerini çekme
data_close_healthcare = fetch_sector_data('healthcare')

# Finans sektörü verilerini çekme
data_close_financials = fetch_sector_data('financials')

# Teknoloji sektörü verilerini çekme
data_close_technology = fetch_sector_data('technology')

[**********************73%%*********             ]  878 of 1196 completed

$BIO.B: possibly delisted; No price data found  (1d 2005-01-01 -> 2024-05-29)


[*********************100%%**********************]  1196 of 1196 completed
ERROR:yfinance:
1 Failed download:
ERROR:yfinance:['BIO.B']: YFPricesMissingError('$%ticker%: possibly delisted; No price data found  (1d 2005-01-01 -> 2024-05-29)')
[*********************100%%**********************]  977 of 977 completed
ERROR:yfinance:
9 Failed downloads:
ERROR:yfinance:['OTEC', 'BRK.B', 'LCAA', 'AGM.A', 'CRD.B', 'GPAT', 'BNRE.A', 'TGVC', 'CRD.A']: YFTzMissingError('$%ticker%: possibly delisted; No timezone found')
[*********************100%%**********************]  785 of 785 completed


In [None]:
#Verilen errorlar ışığında o hisseleri drop ettim
drop_symbol= ['LCAA', 'OTEC', 'BNRE.A', 'BRK.B', 'CRD.B', 'GPAT', 'TGVC', 'CRD.A', 'AGM.A', 'BIO.B']
data_close_financials = data_close_financials.drop(columns=drop_symbol, errors='ignore')
data_close_healthcare = data_close_healthcare.drop(columns=drop_symbol, errors='ignore')
data_close_technology

Ticker,AAOI,AAPL,ACIW,ACLS,ACMR,ACN,ADBE,ADEA,ADI,ADSK,...,ZBRA,ZENV,ZEO,ZEPP,ZETA,ZI,ZM,ZPTA,ZS,ZUO
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-03,,,,,,,,,,,...,,,,,,,,,,
2005-01-10,,1.077956,0.892982,0.829448,,0.988868,0.933333,0.912017,0.949757,0.903394,...,0.934507,,,,,,,,,
2005-01-17,,1.006015,0.994983,1.036982,,0.981989,0.994218,1.147647,1.001138,0.923699,...,0.970045,,,,,,,,,
2005-01-24,,1.010392,1.000000,1.008559,,0.964845,0.980842,0.952332,0.990909,0.939612,...,0.993203,,,,,,,,,
2005-01-31,,1.050719,1.175910,1.039604,,1.017822,0.980119,1.025296,1.006307,0.973360,...,0.988854,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-05-06,1.023593,1.051797,1.060797,1.091597,1.014755,0.993845,1.025228,0.984283,0.996813,0.986904,...,1.045904,0.911111,1.084704,1.012941,0.995356,0.988450,0.989120,0.810127,1.000900,1.026210
2024-05-13,0.913121,1.016945,1.020045,1.018222,0.850600,1.009765,0.992124,1.135729,1.043263,1.019812,...,1.022265,1.146341,0.986501,0.988386,1.237170,0.787823,1.007603,0.914062,0.990225,1.012770
2024-05-20,1.135922,1.020977,1.034320,0.988477,1.019658,0.984942,0.993337,1.017575,1.029354,1.007325,...,0.991037,1.038298,1.033684,1.088132,1.096795,1.014832,1.029379,0.779487,1.014751,1.011639
2024-05-27,1.025641,0.997306,0.956382,0.979599,0.945096,1.011136,0.996377,1.018998,1.099414,0.958898,...,1.028681,1.528688,0.970061,0.832613,0.983381,0.990769,0.969588,0.903509,0.956055,0.978907


In [None]:
# '2005-01-03' tarihli satırı kaldırdım, hepsinde NaN yazıyordu çünkü.
data_close_technology = data_close_technology.drop(pd.Timestamp('2005-01-03'))
data_close_healthcare = data_close_healthcare.drop(pd.Timestamp('2005-01-03'))
data_close_financials = data_close_financials.drop(pd.Timestamp('2005-01-03'))

In [None]:
def data_cleaning(data, threshold=0.25):
  na_counts = data.isna().sum() #NaN sayısı her bir sembolün
  total_count = len(data) #Toplam satır sayısı
  cols_to_drop = na_counts[na_counts > threshold * total_count].index #NaN sayısı %25'den fazla olan semboller
  DroppedData = data.drop(columns=cols_to_drop)
  FilledData = DroppedData.fillna(method='ffill').fillna(method='bfill') #kalan sembollerin NaN'larını doldur

  return FilledData

In [None]:
TechnologyData=data_cleaning(data_close_technology)
HealthCareData=data_cleaning(data_close_healthcare)
FinancialsData=data_cleaning(data_close_financials)
NaN_count = [TechnologyData.isna().sum().sum(),HealthCareData.isna().sum().sum(),FinancialsData.isna().sum().sum()]
print(NaN_count)
Column_Count=[TechnologyData.shape[1],HealthCareData.shape[1],FinancialsData.shape[1]]
print(Column_Count)

[0, 0, 0]
[332, 334, 468]


In [None]:
# !pip install tsfresh
from tsfresh import extract_features
from tsfresh.feature_extraction import EfficientFCParameters

def data_generating(data, label):
    data = data.reset_index()
    print(data)
    data_melted = data.melt(id_vars=["Date"], var_name="id", value_name="value")
    print(data_melted)
    data_melted['Date'] = pd.to_datetime(data_melted['Date'])
    print(data_melted)
    # en önemli verileri çıkarmak istedim.
    extracted_features = extract_features(data_melted, column_id='id', column_sort='Date', default_fc_parameters=EfficientFCParameters())
    extracted_features = extracted_features.reset_index().rename(columns={'index': 'id'})
    extracted_features['class'] = label
    return extracted_features

# bütün sektörleri bir araya getirdim.
features_finance = data_generating(FinancialsData, 'F')
features_healthcare = data_generating(HealthCareData, 'H')
features_technology = data_generating(TechnologyData, 'T')

ts_features = pd.concat([features_finance, features_healthcare, features_technology], ignore_index=True)




Ticker       Date      AAME        AB      ABCB      ACGL      ACIC      ACNB  \
0      2005-01-10  1.058632  0.953846  0.902043  0.986493  1.002759  1.000000   
1      2005-01-17  0.972308  1.029777  0.967576  0.985519  1.002759  1.000387   
2      2005-01-24  0.993671  0.992289  1.026596  1.000267  1.002759  0.984145   
3      2005-01-31  0.958599  1.072122  1.041451  0.987981  1.002759  1.017682   
4      2005-02-07  1.009967  0.994337  0.995025  1.054339  1.002759  0.984556   
...           ...       ...       ...       ...       ...       ...       ...   
1008   2024-05-06  0.939698  0.998201  1.030509  1.043612  0.955473  1.007564   
1009   2024-05-13  0.941176  0.978672  1.002200  1.056458  1.174757  0.989490   
1010   2024-05-20  0.982955  1.038981  1.004391  1.009190  1.086777  1.022155   
1011   2024-05-27  0.971098  0.992319  0.957273  1.013659  0.988593  0.961995   
1012   2024-06-03  1.023810  0.982435  0.966369  0.999414  0.919231  1.001389   

Ticker       AEG       AFG 

Feature Extraction: 100%|██████████| 468/468 [02:20<00:00,  3.34it/s]


Ticker       Date         A      ABEO      ABIO       ABT      ABUS      ABVC  \
0      2005-01-10  0.940664  0.916667  0.987988  1.021290  1.135135  1.000000   
1      2005-01-17  0.959418  0.939394  0.982776  0.963361  1.135135  1.000000   
2      2005-01-24  1.003218  1.009677  0.932990  1.005246  1.135135  1.000000   
3      2005-01-31  1.005500  0.888179  0.892818  0.984344  1.135135  1.000000   
4      2005-02-07  1.039198  1.104317  1.021040  1.005081  1.135135  1.875000   
...           ...       ...       ...       ...       ...       ...       ...   
1008   2024-05-06  1.009258  1.361032  0.918768  0.988081  0.985455  0.784173   
1009   2024-05-13  1.071592  0.892632  1.051829  0.989445  1.070111  0.972477   
1010   2024-05-20  1.029693  1.101415  0.942029  0.988475  1.027586  1.018868   
1011   2024-05-27  0.979476  0.892934  1.003077  1.003565  1.104027  0.925926   
1012   2024-06-03  0.973012  0.971223  1.107362  0.970619  1.003040  0.924100   

Ticker      ACAD      ACHC 

Feature Extraction: 100%|██████████| 334/334 [01:39<00:00,  3.36it/s]


Ticker       Date      AAPL      ACIW      ACLS       ACN      ADBE      ADEA  \
0      2005-01-10  1.077956  0.892982  0.829448  0.988868  0.933333  0.912017   
1      2005-01-17  1.006015  0.994983  1.036982  0.981989  0.994218  1.147647   
2      2005-01-24  1.010392  1.000000  1.008559  0.964845  0.980842  0.952332   
3      2005-01-31  1.050719  1.175910  1.039604  1.017822  0.980119  1.025296   
4      2005-02-07  1.058326  1.011434  1.134694  0.994942  1.137900  0.993963   
...           ...       ...       ...       ...       ...       ...       ...   
1008   2024-05-06  1.051797  1.060797  1.091597  0.993845  1.025228  0.984283   
1009   2024-05-13  1.016945  1.020045  1.018222  1.009765  0.992124  1.135729   
1010   2024-05-20  1.020977  1.034320  0.988477  0.984942  0.993337  1.017575   
1011   2024-05-27  0.997306  0.956382  0.979599  1.011136  0.996377  1.018998   
1012   2024-06-03  1.003654  0.959149  1.018842  0.958487  0.982400  0.967797   

Ticker       ADI      ADSK 

Feature Extraction: 100%|██████████| 332/332 [01:35<00:00,  3.47it/s]


In [None]:
ts_features = ts_features.dropna(axis=1)
nunique = ts_features.nunique()
cols_to_drop = nunique[nunique == 1].index
ts_features = ts_features.drop(columns=cols_to_drop)
ts_features

Unnamed: 0,id,value__variance_larger_than_standard_deviation,value__has_duplicate_max,value__has_duplicate_min,value__has_duplicate,value__sum_values,value__abs_energy,value__mean_abs_change,value__mean_change,value__mean_second_derivative_central,...,value__fourier_entropy__bins_5,value__fourier_entropy__bins_10,value__fourier_entropy__bins_100,value__permutation_entropy__dimension_3__tau_1,value__permutation_entropy__dimension_4__tau_1,value__permutation_entropy__dimension_5__tau_1,value__permutation_entropy__dimension_6__tau_1,value__permutation_entropy__dimension_7__tau_1,value__mean_n_absolute_max__number_of_maxima_7,class
0,AAME,0.0,0.0,0.0,1.0,1015.451589,1024.386465,0.076803,-0.000034,0.000069,...,1.158093,1.744168,3.751036,1.789879,3.164210,4.717093,6.136636,6.761558,1.428218,F
1,AB,0.0,0.0,0.0,1.0,1014.087827,1017.834843,0.049652,0.000028,-0.000042,...,1.366084,2.009976,3.919320,1.790240,3.166114,4.730625,6.142403,6.771012,1.242084,F
2,ABCB,0.0,0.0,0.0,1.0,1015.559692,1021.313235,0.054693,0.000064,-0.000028,...,1.272070,1.877135,3.769320,1.789880,3.166346,4.726192,6.151422,6.769298,1.262056,F
3,ACGL,0.0,0.0,0.0,1.0,1016.756814,1021.681310,0.032899,0.000013,-0.000007,...,1.106176,1.739233,3.723178,1.789426,3.169093,4.730159,6.181532,6.798909,1.177442,F
4,ACIC,0.0,0.0,0.0,1.0,1017.254546,1028.979806,0.063257,-0.000083,-0.000034,...,1.290751,1.870537,3.837543,1.697911,2.940149,4.283062,5.442282,5.926096,1.588024,F
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1129,WNS,0.0,0.0,0.0,1.0,1016.677550,1023.341170,0.049375,-0.000053,-0.000029,...,1.467615,2.091134,4.021655,1.783161,3.121704,4.595650,5.908430,6.476736,1.287408,T
1130,WOLF,0.0,0.0,0.0,1.0,1015.245140,1022.893308,0.078100,0.000202,0.000041,...,1.398047,2.078740,4.008463,1.785082,3.151472,4.707169,6.105756,6.753817,1.298448,T
1131,WYY,0.0,0.0,0.0,1.0,1016.426451,1029.245663,0.103375,-0.000002,-0.000019,...,1.199081,1.836859,3.819743,1.790312,3.171514,4.732598,6.202510,6.809923,1.419213,T
1132,XRX,0.0,0.0,0.0,1.0,1013.215625,1016.137552,0.051554,0.000031,-0.000020,...,1.358848,1.994244,3.935545,1.790815,3.171060,4.723483,6.143277,6.780649,1.223210,T


In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
#burada veri setini makine öğrenimine uygun hale getirdim.
X = ts_features.drop('class', axis=1)
y = ts_features['class']
numeric_degerler = X.select_dtypes(include=['int64', 'float64']).columns
numeric_donusum = Pipeline(steps=[('scaler', StandardScaler()),('PCA', PCA(n_components=25))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_donusum, numeric_degerler)])

X_pca = preprocessor.fit_transform(X)
print(X_pca.shape)
y.shape

(1134, 25)


(1134,)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.4, random_state=42)
# Modeli eğitme süreci
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Test başarı oranlarını bulmak için baktığım kısım
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Başarı Oranı: {test_accuracy}")

Test Başarı Oranı: 0.698237885462555


In [None]:
from sklearn.svm import SVC

model = SVC(kernel='rbf', random_state=42)
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Test başarı oranlarını bulmak için baktığım kısım
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Başarı Oranı: {test_accuracy}")

Test Başarı Oranı: 0.7400881057268722


In [None]:
symbols_energy = pd.read_csv('../content/energy.csv')['Symbol'].tolist()
data_energy = yf.download(symbols_energy, start='2005-01-01')
data_close_energy = data_energy['Open'].resample('W-MON').last().pct_change() + 1
data_close_energy = data_close_energy.drop(pd.Timestamp('2005-01-03'))
EnergyData=data_cleaning(data_close_energy)
NaN_count = [EnergyData.isna().sum().sum()]
print(NaN_count)
Column_Count=[EnergyData.shape[1]]
print(Column_Count)

[*********************100%%**********************]  254 of 254 completed
ERROR:yfinance:
1 Failed download:
ERROR:yfinance:['PBR.A']: YFTzMissingError('$%ticker%: possibly delisted; No timezone found')


[0]
[148]


In [None]:
import numpy as np
inf_indices = np.isinf(EnergyData).any().sum()
inf_columns = EnergyData.columns[inf_indices]
EnergyData.drop(columns=['VRN'], inplace=True)

In [None]:
features_energy = data_generating(EnergyData, 'E')



Ticker       Date        AE      AMTX       APA      ARLP      AROC       BKR  \
0      2005-01-10  0.977954  0.888889  0.968954  0.919094  1.005188  0.996429   
1      2005-01-17  1.040462  0.888889  1.053061  1.056191  1.005188  1.001195   
2      2005-01-24  1.050000  0.888889  1.009690  1.000139  1.005188  1.018854   
3      2005-01-31  1.122222  0.888889  1.033781  0.961944  1.005188  0.992738   
4      2005-02-07  1.138614  0.888889  1.023765  1.051112  1.005188  1.038697   
...           ...       ...       ...       ...       ...       ...       ...   
1008   2024-05-06  0.960362  1.062972  0.909287  1.080882  1.008479  0.975395   
1009   2024-05-13  1.028240  0.938389  1.029861  0.955782  1.004451  1.016194   
1010   2024-05-20  0.985916  0.964646  1.020758  1.014680  1.018710  1.026663   
1011   2024-05-27  0.981429  0.903141  0.959974  1.019290  0.946351  0.955224   
1012   2024-06-03  0.957060  1.031884  1.002354  1.032258  1.002554  1.008125   

Ticker      BOOM        BP 

Feature Extraction: 100%|██████████| 147/147 [00:42<00:00,  3.44it/s]


In [None]:
features_energy = features_energy.dropna(axis=1)
nunique = features_energy.nunique()
cols_to_drop = nunique[nunique == 1].index
features_energy = features_energy.drop(columns=cols_to_drop)
features_energy

Unnamed: 0,id,value__variance_larger_than_standard_deviation,value__has_duplicate,value__sum_values,value__abs_energy,value__mean_abs_change,value__mean_change,value__mean_second_derivative_central,value__median,value__mean,...,value__fourier_entropy__bins_3,value__fourier_entropy__bins_5,value__fourier_entropy__bins_10,value__fourier_entropy__bins_100,value__permutation_entropy__dimension_3__tau_1,value__permutation_entropy__dimension_4__tau_1,value__permutation_entropy__dimension_5__tau_1,value__permutation_entropy__dimension_6__tau_1,value__permutation_entropy__dimension_7__tau_1,value__mean_n_absolute_max__number_of_maxima_7
0,AE,0.0,1.0,1015.827031,1023.590159,0.072103,-0.000021,-4.296626e-05,0.999821,1.002791,...,0.925050,1.441503,2.067194,3.986970,1.790717,3.173109,4.740947,6.176535,6.779454,1.290477
1,AMTX,0.0,1.0,1016.914083,1060.286637,0.175362,0.000141,6.367094e-05,0.981595,1.003864,...,0.906568,1.137664,1.806356,3.731415,1.781316,3.117403,4.589766,5.885281,6.470439,2.214600
2,APA,0.0,1.0,1014.951290,1021.873137,0.066428,0.000033,-2.063686e-05,1.001805,1.001926,...,0.840729,1.332049,1.893007,3.827333,1.789405,3.170915,4.746143,6.212352,6.776701,1.358094
3,ARLP,0.0,1.0,1014.778357,1019.588104,0.056643,0.000112,-6.138928e-05,1.000000,1.001756,...,0.756329,1.146235,1.799977,3.795755,1.791260,3.172706,4.745282,6.176022,6.796156,1.215111
4,AROC,0.0,1.0,1014.658008,1020.817097,0.061244,-0.000003,2.779561e-05,1.005188,1.001637,...,0.864382,1.272713,1.948265,3.831514,1.746134,3.028233,4.445968,5.670165,6.158858,1.280588
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
142,WKC,0.0,1.0,1015.305522,1020.854307,0.057521,0.000050,-3.095509e-05,1.003556,1.002276,...,0.923396,1.399736,2.063781,4.013404,1.788446,3.159680,4.709702,6.137660,6.779974,1.273749
143,WMB,0.0,1.0,1015.487555,1020.666403,0.052432,0.000059,-3.405750e-05,1.003997,1.002456,...,0.930671,1.356849,1.992388,3.906927,1.791450,3.171288,4.730897,6.194649,6.793221,1.211863
144,WTI,0.0,1.0,1015.462950,1027.332245,0.097401,0.000012,2.981751e-05,1.000000,1.002431,...,0.981941,1.471783,2.127300,4.024687,1.790113,3.167755,4.726684,6.121625,6.761038,1.411249
145,XOM,0.0,1.0,1014.408175,1017.009208,0.034673,0.000020,6.545728e-07,1.002196,1.001390,...,1.055950,1.492199,2.168731,4.094644,1.790315,3.171944,4.733939,6.192778,6.793221,1.130888


In [None]:
# Sayısal olmayan sütunları çıkarma
if 'id' in features_energy.columns:
    features_energy_numeric = features_energy.drop('id', axis=1)
else:
    features_energy_numeric = features_energy

# Modeli sayısal verilerle eğitme
iso_forest = IsolationForest(contamination=0.05)
iso_forest.fit(features_energy_numeric)

# Aykırı değer tahminleri (-1 aykırı, 1 normal)
outliers_pred = iso_forest.predict(features_energy_numeric)

# Aykırı değerleri DataFrame olarak döndür
features_energy['outlier'] = outliers_pred
outliers = features_energy[features_energy['outlier'] == -1]
print("Aykırı değer içeren gözlemler:")
print(outliers)


Aykırı değer içeren gözlemler:
       id  value__variance_larger_than_standard_deviation  \
13   CHRD                                             1.0   
43     EP                                             0.0   
99    PED                                             0.0   
100  PFIE                                             0.0   
104  RCON                                             0.0   
105   REI                                             0.0   
138  VIVK                                             0.0   
140  VTNR                                             0.0   

     value__has_duplicate  value__sum_values  value__abs_energy  \
13                    1.0        1248.947519       74087.825663   
43                    1.0        1037.672625        1124.102809   
99                    1.0        1038.292404        1264.333069   
100                   1.0         896.955307         846.634089   
104                   1.0        1106.233404        1262.605527   
105              



In [None]:
numeric_degerler = outliers.select_dtypes(include=['int64', 'float64']).columns
numeric_donusum = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('PCA', PCA(n_components=8))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_donusum, numeric_degerler)
    ])

features_energy_pca = preprocessor.fit_transform(outliers)
outliers

Unnamed: 0,id,value__variance_larger_than_standard_deviation,value__has_duplicate,value__sum_values,value__abs_energy,value__mean_abs_change,value__mean_change,value__mean_second_derivative_central,value__median,value__mean,...,value__fourier_entropy__bins_5,value__fourier_entropy__bins_10,value__fourier_entropy__bins_100,value__permutation_entropy__dimension_3__tau_1,value__permutation_entropy__dimension_4__tau_1,value__permutation_entropy__dimension_5__tau_1,value__permutation_entropy__dimension_6__tau_1,value__permutation_entropy__dimension_7__tau_1,value__mean_n_absolute_max__number_of_maxima_7,outlier
13,CHRD,1.0,1.0,1248.947519,74087.825663,0.603274,0.000146,2e-05,0.968623,1.23292,...,0.090729,0.589753,2.215733,1.647308,2.769964,3.95868,4.93231,5.272641,40.173723,-1
43,EP,0.0,1.0,1037.672625,1124.102809,0.236328,0.000109,-1.8e-05,1.0,1.024356,...,1.487798,2.140704,4.004219,1.78082,3.126862,4.611435,5.956014,6.647091,2.634269,-1
99,PED,0.0,1.0,1038.292404,1264.333069,0.231263,0.000211,4.1e-05,0.994118,1.024968,...,1.118096,1.77656,3.735696,1.787414,3.159722,4.704585,6.10147,6.763272,4.685639,-1
100,PFIE,0.0,1.0,896.955307,846.634089,0.078062,0.000458,-3.9e-05,0.966102,0.885445,...,0.090729,0.136002,1.379245,1.630478,2.754877,3.927987,4.895377,5.276483,1.684606,-1
104,RCON,0.0,1.0,1106.233404,1262.605527,0.111922,-0.00045,-2.9e-05,1.020408,1.092037,...,0.090729,0.720896,2.77242,1.691139,2.886066,4.126492,5.193982,5.575811,2.403299,-1
105,REI,0.0,1.0,1065.036296,1149.395279,0.082907,-0.000357,3.5e-05,1.0,1.051369,...,0.090729,0.136002,1.695235,1.631079,2.800004,4.038732,5.057237,5.518864,1.807686,-1
138,VIVK,0.0,1.0,1031.852303,1165.458568,0.195524,0.000101,0.000103,1.0,1.01861,...,1.312375,1.919303,3.904517,1.615907,2.744799,3.921704,4.914858,5.330364,3.441769,-1
140,VTNR,0.0,1.0,1030.02597,1115.67942,0.186492,0.000157,-1.6e-05,0.995425,1.016807,...,1.12023,1.719573,3.693821,1.791136,3.169681,4.724845,6.15266,6.78548,3.217572,-1


In [None]:
outliers["id"]

13     CHRD
43       EP
99      PED
100    PFIE
104    RCON
105     REI
138    VIVK
140    VTNR
Name: id, dtype: object

In [None]:
y_pred_energy = model.predict(features_energy_pca)
y_pred_energy
#Enerji sektöründeki firmaların %50'si techology firmalarına benzer hareket ediyormuş.

array(['T', 'F', 'F', 'T', 'F', 'T', 'H', 'T'], dtype=object)

In [None]:
id_pred_dict = dict(zip(outliers["id"], y_pred_energy))
id_pred_dict

{'CHRD': 'T',
 'EP': 'F',
 'PED': 'F',
 'PFIE': 'T',
 'RCON': 'F',
 'REI': 'T',
 'VIVK': 'H',
 'VTNR': 'T'}