#### Bu Jupyter Notebook, veri çekme işlemlerini ve tsfresh kütüphanesinin kullanımını içerir. 
#### Amacı, bir sonraki makine öğrenimi adımları için veriyi indirmek ve işlemeye hazır hale getirmektir.

<font color='#76ABAE'>Aşağıdaki kod, IPython'da tüm hücrelerde tüm sonuçların görüntülenmesini sağlar.</font>

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
import yfinance
import requests
from bs4 import BeautifulSoup
from tsfresh import extract_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction.settings import MinimalFCParameters
import pandas as pd

def fetch_sectors_names():
    url = "https://stockanalysis.com/stocks/industry/sectors/"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        df=pd.read_html(str(soup.find_all("table")))[0]
    else:
        print(f"Error: Failed to fetch data from page {url}")
        
    return df

def fetch_industry_names():
    url = "https://stockanalysis.com/stocks/industry/all/"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        df=pd.read_html(str(soup.find_all("table")))[0]
    else:
        print(f"Error: Failed to fetch data from page {url}")
        
    return df

def fetch_data(sectors):
    url = f"https://stockanalysis.com/stocks/sector/{sectors}/"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        df=pd.read_html(str(soup.find_all("table")))[0]
        df.drop(columns='No.', inplace=True)
    else:
        print(f"Error: Failed to fetch data from page {url}")
        
    return df

In [3]:
sectors=fetch_sectors_names()

  df=pd.read_html(str(soup.find_all("table")))[0]


In [4]:
sectors

Unnamed: 0,Sector Name,Stocks,Market Cap,Div. Yield,PE Ratio,Profit Margin,1D Change,1Y Change
0,Financials,1386,"9,699.10B",2.41%,14.75,17.74%,0.52%,11.58%
1,Healthcare,1216,"8,176.17B",0.43%,50.90,4.08%,1.32%,7.51%
2,Technology,787,17.83T,0.43%,45.13,13.17%,0.28%,15.00%
3,Industrials,651,"5,473.20B",1.11%,26.67,7.33%,0.55%,17.68%
4,Consumer Discretionary,577,"7,330.14B",0.65%,27.61,5.94%,0.31%,2.23%
5,Materials,263,"2,067.96B",1.56%,19.62,8.67%,0.45%,2.24%
6,Real Estate,261,"1,507.63B",4.11%,50.72,8.89%,-0.14%,6.94%
7,Communication Services,260,"5,389.33B",1.09%,28.16,10.40%,0.29%,2.06%
8,Energy,253,"3,643.44B",2.85%,8.01,12.42%,0.43%,18.22%
9,Consumer Staples,241,"4,039.77B",1.45%,29.77,4.72%,0.47%,12.18%


## Sektor listerine erismek

<font color='#76ABAE'>Yukarida yazilan fonksyionlar ile hangi sembollerin hangi sektorlerde oldugu bilgisine erisim saglanabilir ve asagidaki betikler yardimi ile `.csv` dosyalarinda saklayabiliriz. Sonrasinda, sektor bazli sembollere ait verileri indirilebilir ve siniflandirilabilir.</font>

In [5]:
fetch_data(sectors='financials').to_csv('./data/stock_sectors/financials.csv')
fetch_data(sectors='healthcare').to_csv('./data/stock_sectors/healthcare.csv')
fetch_data(sectors='technology').to_csv('./data/stock_sectors/technology.csv')

  df=pd.read_html(str(soup.find_all("table")))[0]
  df=pd.read_html(str(soup.find_all("table")))[0]
  df=pd.read_html(str(soup.find_all("table")))[0]


### <font>Sektör verilerini oku</font>

In [6]:
sectors = {
    'finance': pd.read_csv('./data/stock_sectors/financials.csv'),
    'healthcare': pd.read_csv('./data/stock_sectors/healthcare.csv'),
    'technology': pd.read_csv('./data/stock_sectors/technology.csv')
}

### Symbol sütunumzdaki NaN değerleri inceleyelim. 
<font color='#76ABAE'>Bir sonraki adımda Symbol değeri NaN olan bir şirketin değerlerini çekemeyiz. Bundan dolayı bunları düzeltmemiz lazım.</font> 


In [7]:
for sector, data in sectors.items():
    print(f"{sector.capitalize()} NaN Count:", data['Symbol'].isna().sum())

Finance NaN Count: 0
Healthcare NaN Count: 0
Technology NaN Count: 1


<font color='#76ABAE'>Sadece technology datasında NaN değer var.</font>

In [8]:
sectors['technology'][sectors['technology']['Symbol'].isna()]

Unnamed: 0.1,Unnamed: 0,Symbol,Company Name,Market Cap,% Change,Volume,Revenue
543,543,,Nano Labs Ltd,136.10M,0.77%,14978,655.30M


<font color='#76ABAE'>Yuklarıdaki NaN olan "Nano Labs Ltd" şirketi için internette araşırma yaptıktan sonra "Symbol" değerinin "NA" olması gerektiğini buldum.</font>

In [9]:
# NaN değerleri dolduralım
sectors['technology']['Symbol'].fillna("NA", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  sectors['technology']['Symbol'].fillna("NA", inplace=True)


In [10]:
# Hisse fiyatlarını al ve CSV'ye kaydedelim
for sector, data in sectors.items():
    print(f"{sector.capitalize()} verileri çekiliyor...")
    symbols = data.dropna()['Symbol'].tolist()
    stock_data = yfinance.download(symbols, start='2005-01-01')['Adj Close'].resample('ME').last().pct_change()+1
    stock_data.to_csv(f'./data/stock_values/{sector}.csv')
    print(f"{sector.capitalize()} verileri CSV'ye kaydedildi.")

Finance verileri çekiliyor...


[*********************100%%**********************]  998 of 998 completed

8 Failed downloads:
['BNRE.A', 'BRK.B', 'CRD.A', 'CRD.B', 'LEGT', 'DISA', 'AGM.A', 'DYCQ']: Exception('%ticker%: No timezone found, symbol may be delisted')
  stock_data = yfinance.download(symbols, start='2005-01-01')['Adj Close'].resample('ME').last().pct_change()+1


Finance verileri CSV'ye kaydedildi.
Healthcare verileri çekiliyor...


[*********************100%%**********************]  1217 of 1217 completed

1 Failed download:
['BIO.B']: Exception('%ticker%: No price data found, symbol may be delisted (1d 2005-01-01 -> 2024-03-16)')
  stock_data = yfinance.download(symbols, start='2005-01-01')['Adj Close'].resample('ME').last().pct_change()+1


Healthcare verileri CSV'ye kaydedildi.
Technology verileri çekiliyor...


[*********************100%%**********************]  787 of 787 completed


Technology verileri CSV'ye kaydedildi.


In [11]:
def extract_tsfresh_features(dataframe):
    # DataFrame'i transpoze etme
    transposed_df = dataframe.transpose()
    
    # İlk satırı sütun isimleri olarak ayarlayalım
    transposed_df.columns = transposed_df.iloc[0]
    transposed_df = transposed_df[1:]
    transposed_df = transposed_df.rename_axis(columns='date')
    transposed_df.fillna(0, inplace=True)
    
    # MinimalFCParameters kullanarak özellik çıkarımı yaplım
    param = MinimalFCParameters()
    
    # Her bir sütun için Tsfresh'ı uygulayın
    extracted_features = pd.DataFrame()
    for column in transposed_df.columns:
        ts = transposed_df[column].reset_index()
        ts.columns = ["id", "value"]
        ts["time"] = column
        features = extract_features(ts, column_id="id", column_sort="time", default_fc_parameters=param)
        extracted_features = pd.concat([extracted_features, features], axis=1)
    
    impute(extracted_features)
    
    return extracted_features

In [12]:
# Veri yükleme
data_fin = pd.read_csv("./data/stock_values/finance.csv")
data_health = pd.read_csv('./data/stock_values/healthcare.csv')
data_tech = pd.read_csv('./data/stock_values/technology.csv')

In [13]:
# Tsfresh özellik çıkarımını yapın
extracted_features_fin = extract_tsfresh_features(data_fin)
extracted_features_health = extract_tsfresh_features(data_health)
extracted_features_tech = extract_tsfresh_features(data_tech)

  transposed_df.fillna(0, inplace=True)
Feature Extraction: 100%|██████████| 30/30 [00:04<00:00,  6.19it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 11.38it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 12.11it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 12.85it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 13.16it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 11.99it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 13.23it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 11.42it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 11.00it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 12.73it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 12.91it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 11.25it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 11.29it/s]
Feature Extraction: 100%|██████████| 30/30 [00:02<00:00, 11.32it/s]
Feature 

# Sonuçları kaydetme

In [15]:
extracted_features_fin['label'] = 'f'
extracted_features_health['label'] = 'h'
extracted_features_tech['label'] = 't'

extracted_features_fin.to_csv("./data/extracted_data/fin_min_param.csv")
extracted_features_health.to_csv("./data/extracted_data/health_min_param.csv")
extracted_features_tech.to_csv("./data/extracted_data/tech_min_param.csv")

# Verileri birleştir
general_data = pd.concat([extracted_features_fin,
                          extracted_features_health,
                          extracted_features_tech], ignore_index=True)

# Birleştirilmiş verileri kaydet
general_data.to_csv("./data/extracted_data/general_data.csv", index=False)