# **Data Collection**

## Objectives

* Fetch data from yahoo finance and save as raw data
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Generate Dataset: outputs/datasets/collection/phnx_2015_2025.csv

## Additional Comments

* I converted all the feature names to lowercase for ease of use 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\ajv10\\Documents\\vscode-projects\\PP5\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\ajv10\\Documents\\vscode-projects\\PP5'

# Section 1

Section 1 content

Install yahoo finance

In [4]:
%pip install yfinance==0.2.56

Note: you may need to restart the kernel to use updated packages.


Import yahoo finance and 'Phoenix Group Holdings plc' dataset from 2015 to 2025

In [5]:
import yfinance as yf
ticker = yf.Ticker("PHNX.L")

data = ticker.history(start="2015-01-01", end="2025-01-01")

Create input directory for datasets

In [6]:
dir = "inputs/datasets/raw"

try:
  # create here your folder
  os.makedirs(name=dir)
  print(f"directory {dir} created")
except Exception as e:
  print(e)

directory inputs/datasets/raw created


Save csv file to input directory

In [7]:
data.to_csv(f"{current_dir}/{dir}/phnx_2015_2025.csv")
print("csv file saved")

csv file saved


# Section 2

Section 2 content

Import pandas and upload csv file from inputs dataset to a dataframe

In [8]:
import pandas as pd
df = pd.read_csv(f"{dir}/phnx_2015_2025.csv")
print(df.shape)
df

(2526, 8)


Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits
0,2015-01-02 00:00:00+00:00,629.093287,630.999626,621.269665,624.899292,385199,0.0,0.0
1,2015-01-05 00:00:00+00:00,624.517926,628.330604,611.204044,612.317383,416491,0.0,0.0
2,2015-01-06 00:00:00+00:00,616.130071,616.130071,597.447876,597.447876,781658,0.0,0.0
3,2015-01-07 00:00:00+00:00,603.548219,605.812932,598.286709,604.310791,743617,0.0,0.0
4,2015-01-08 00:00:00+00:00,607.742163,612.317413,605.454508,611.936157,453541,0.0,0.0
...,...,...,...,...,...,...,...,...
2521,2024-12-23 00:00:00+00:00,498.162306,502.260357,496.762980,501.260834,1071229,0.0,0.0
2522,2024-12-24 00:00:00+00:00,502.260351,505.258920,501.760590,501.760590,436491,0.0,0.0
2523,2024-12-27 00:00:00+00:00,500.261324,503.759655,500.261324,502.760132,881757,0.0,0.0
2524,2024-12-30 00:00:00+00:00,501.760601,504.759170,498.961948,500.761078,608408,0.0,0.0


Convert all feature names to lowercase

In [9]:
df.columns = df.columns.str.lower()
df.head(3)

Unnamed: 0,date,open,high,low,close,volume,dividends,stock splits
0,2015-01-02 00:00:00+00:00,629.093287,630.999626,621.269665,624.899292,385199,0.0,0.0
1,2015-01-05 00:00:00+00:00,624.517926,628.330604,611.204044,612.317383,416491,0.0,0.0
2,2015-01-06 00:00:00+00:00,616.130071,616.130071,597.447876,597.447876,781658,0.0,0.0


Check for missing and duplicated data

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2526 entries, 0 to 2525
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          2526 non-null   object 
 1   open          2526 non-null   float64
 2   high          2526 non-null   float64
 3   low           2526 non-null   float64
 4   close         2526 non-null   float64
 5   volume        2526 non-null   int64  
 6   dividends     2526 non-null   float64
 7   stock splits  2526 non-null   float64
dtypes: float64(6), int64(1), object(1)
memory usage: 158.0+ KB


In [11]:
df.isna().sum()

date            0
open            0
high            0
low             0
close           0
volume          0
dividends       0
stock splits    0
dtype: int64

In [12]:
df[df.duplicated(subset=['date'])]

Unnamed: 0,date,open,high,low,close,volume,dividends,stock splits


Check for unique values

In [13]:
df['dividends'].unique()

array([ 0.    , 20.4201, 22.5713, 22.6   , 23.4   , 26.65  ])

In [14]:
df['stock splits'].unique()

array([0.])

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

Create output directory and save modified csv file

In [15]:
dir = "outputs/datasets/collection"
try:
  # create here your folder
  os.makedirs(name=dir)
  print(f"directory {dir} created")
except Exception as e:
  print(e)

directory outputs/datasets/collection created


In [16]:
df.to_csv(f"{current_dir}/{dir}/phnx_2015_2025.csv")
print("csv file saved")

csv file saved
