<a href="https://colab.research.google.com/github/KarolinaK-14/ML/blob/main/feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Library page: [https://scikit-learn.org](https://scikit-learn.org)

Documentation/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

The core library for machine learning in Python.

To install the library, use the command below:
```
!pip install scikit-learn
```
To update the library to the latest version, use the command below:
```
!pip install --upgrade scikit-learn
```
### Table of contents:
1. [Importing libraries](#0)
2. [Loading data](#1)
3. [Creating a copy of the data](#2)
4. [Generating new variables](#3)
5. [Discretization of a continuous variable](#4)
6. [Feature extraction](#5)


### <a name='0'></a> Importing libraries

In [1]:
# import numpy as np
# import pandas as pd
# import sklearn

import numpy as np
import pandas as pd
import sklearn

### <a name='1'></a> Loading data

In [2]:
# def fetch_financial_data(company='AMZN'):
#     """
#     This function fetches stock market quotations.
#     """
#     import pandas_datareader.data as web
#     return web.DataReader(name=company, data_source='stooq')

# df_raw = fetch_financial_data()
# df_raw.head()

import pandas_datareader

In [4]:
df_raw = pandas_datareader.data.DataReader(name='AMZN', data_source='stooq')
df_raw.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-07-03,221.82,224.01,221.36,223.41,29632353
2025-07-02,219.73,221.6,219.06,219.92,30894178
2025-07-01,219.5,221.875,217.93,220.46,39256830
2025-06-30,223.52,223.82,219.12,219.39,58887780
2025-06-27,219.92,223.3,216.74,223.3,119217138


### <a name='2'></a> Creating a copy of the data

In [6]:
# df = df_raw.copy()
# df = df[:5]
# df.info()

df = df_raw.copy()
df = df[:5]
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5 entries, 2025-07-03 to 2025-06-27
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Open    5 non-null      float64
 1   High    5 non-null      float64
 2   Low     5 non-null      float64
 3   Close   5 non-null      float64
 4   Volume  5 non-null      int64  
dtypes: float64(4), int64(1)
memory usage: 240.0 bytes


### <a name='3'></a> Generating new variables

In [7]:
df.index.month

Index([7, 7, 7, 6, 6], dtype='int32', name='Date')

In [8]:
# df['day'] = df.index.day
# df['month'] = df.index.month
# df['year'] = df.index.year
# df

df['day'] = df.index.day
df['month'] = df.index.month
df['year'] = df.index.year
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,day,month,year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2025-07-03,221.82,224.01,221.36,223.41,29632353,3,7,2025
2025-07-02,219.73,221.6,219.06,219.92,30894178,2,7,2025
2025-07-01,219.5,221.875,217.93,220.46,39256830,1,7,2025
2025-06-30,223.52,223.82,219.12,219.39,58887780,30,6,2025
2025-06-27,219.92,223.3,216.74,223.3,119217138,27,6,2025


### <a name='4'></a> Discretization of a continuous variable

In [9]:
df = pd.DataFrame(data={'height': [175., 178.5, 185., 191., 184.5, 183., 168.]})
df

Unnamed: 0,height
0,175.0
1,178.5
2,185.0
3,191.0
4,184.5
5,183.0
6,168.0


In [11]:
# df['height_cat'] = pd.cut(x=df.height, bins=(160, 175, 190, 260), labels=['short', 'tall', 'very tall'])
# df

df['height_cat'] = pd.cut(x=df.height, bins=3)
df

Unnamed: 0,height,height_cat
0,175.0,"(167.977, 175.667]"
1,178.5,"(175.667, 183.333]"
2,185.0,"(183.333, 191.0]"
3,191.0,"(183.333, 191.0]"
4,184.5,"(183.333, 191.0]"
5,183.0,"(175.667, 183.333]"
6,168.0,"(167.977, 175.667]"


In [12]:
df.height_cat = pd.cut(x=df.height, bins=(160, 170, 180, 200))
df

Unnamed: 0,height,height_cat
0,175.0,"(170, 180]"
1,178.5,"(170, 180]"
2,185.0,"(180, 200]"
3,191.0,"(180, 200]"
4,184.5,"(180, 200]"
5,183.0,"(180, 200]"
6,168.0,"(160, 170]"


In [13]:
df.height_cat = pd.cut(x=df.height, bins=(160, 170, 180, 190, 200), labels=['small', 'medium', 'tall', 'very tall'])
df

Unnamed: 0,height,height_cat
0,175.0,medium
1,178.5,medium
2,185.0,tall
3,191.0,very tall
4,184.5,tall
5,183.0,tall
6,168.0,small


In [16]:
# pd.get_dummies(df, drop_first=True, prefix='height', dtype=int)
pd.get_dummies(df, drop_first=True, prefix='height', dtype='int')

Unnamed: 0,height,height_medium,height_tall,height_very tall
0,175.0,1,0,0
1,178.5,1,0,0
2,185.0,0,1,0
3,191.0,0,0,1
4,184.5,0,1,0
5,183.0,0,1,0
6,168.0,0,0,0


### <a name='5'></a> Feature extraction

In [17]:
df = pd.DataFrame(data={'lang': [['PL', 'ENG'], ['GER', 'ENG', 'PL', 'FRA'], ['RUS']]})
df

Unnamed: 0,lang
0,"[PL, ENG]"
1,"[GER, ENG, PL, FRA]"
2,[RUS]


In [18]:
# df['lang_number'] = df['lang'].apply(len)
df['lang_number'] = df.lang.apply(len)
df

Unnamed: 0,lang,lang_number
0,"[PL, ENG]",2
1,"[GER, ENG, PL, FRA]",4
2,[RUS],1


In [20]:
df['PL_flag']=df.lang.apply(lambda x: 1 if 'PL' in x else 0)
df

Unnamed: 0,lang,lang_number,PL_lang,PL_flag
0,"[PL, ENG]",2,1,1
1,"[GER, ENG, PL, FRA]",4,1,1
2,[RUS],1,0,0


In [22]:
del df['PL_lang']
df

Unnamed: 0,lang,lang_number,PL_flag
0,"[PL, ENG]",2,1
1,"[GER, ENG, PL, FRA]",4,1
2,[RUS],1,0


In [23]:
df = pd.DataFrame(data={'website': ['wp.pl', 'onet.pl', 'google.com']})
df

Unnamed: 0,website
0,wp.pl
1,onet.pl
2,google.com


In [28]:
# df.website.str.split('.', expand=True)
# df
new = df.website.str.split('.', expand=True)
new

Unnamed: 0,0,1
0,wp,pl
1,onet,pl
2,google,com


In [29]:
df['portal'] = new[0]
df['extension'] = new[1]
df

Unnamed: 0,website,portal,extension
0,wp.pl,wp,pl
1,onet.pl,onet,pl
2,google.com,google,com


In [30]:
new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3 non-null      object
 1   1       3 non-null      object
dtypes: object(2)
memory usage: 180.0+ bytes
