# **Data Cleaning**

## Objectives

* Perform data cleaning and preprocessing

## Inputs

* outputs/datasets/collection/phnx_2015_2025.csv

## Outputs

* New exploratory features added
* Target variable column defined for classification and regression
* Cleaned dataset with consistent formatting and no missing values
* Train and Test sets, both saved under outputs/datasets/cleaned

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

## Load Data

In [None]:
import pandas as pd

df = pd.read_csv("outputs/datasets/collection/phnx_2015_2025.csv")
df.head(3)

# Section 1

In [None]:
df = df.copy()
print(df.shape)
df

## Data Exploration

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

In [None]:
from ydata_profiling import ProfileReport
if vars_with_missing_data:
    profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
    profile.to_notebook_iframe()
else:
    print("There are no variables with missing data")

---

---

# Section 2

Section 2 content

In [None]:
df = df.drop(['dividends','stock splits'],axis=1)

In [None]:
df['pre_open'] = df['open'].shift(+1)
df['pre_high'] = df['high'].shift(+1)
df['pre_low'] = df['low'].shift(+1)
df['pre_close'] = df['close'].shift(+1)
df['pre_vol'] = df['volume'].shift(+1)
df['pre_open_2'] = df['open'].shift(+2)
df['pre_high_2'] = df['high'].shift(+2)
df['pre_low_2'] = df['low'].shift(+2)
df['pre_close_2'] = df['close'].shift(+2)
df['pre_vol_2'] = df['volume'].shift(+2)
print(df.shape)
df

In [None]:
df['average'] = df[['open', 'close']].mean(axis=1)

In [None]:
df['tomorrows average'] = df[['open', 'close']].mean(axis=1).shift(-1)

In [None]:
df['target'] = (df['tomorrows average'] > df['average']).astype(int)
df = df.drop(['tomorrows average'],axis=1)
print(df.shape)
df

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.4, PPS_Threshold =0.2,
                  figsize=(12,10), font_annot=10)

## Assessing Missing Data Levels

In [None]:
def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute/len(df)*100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                   "PercentageOfDataset": missing_data_percentage,
                                   "DataType": df.dtypes}
                                    )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                          )

    return df_missing_data

In [None]:

EvaluateMissingData(df)

In [None]:
df = df.dropna()
print(df.shape)
df

In [None]:
EvaluateMissingData(df)

In [None]:
print(df.shape)
plt.figure(figsize=(12, 5))
sns.countplot(data=df, x='target', hue='target', order=df['target'].value_counts().index)
plt.xticks(rotation=90)
plt.show()

# Section 3

## Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['target'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

In [None]:
df_missing_data = EvaluateMissingData(TrainSet)
print(f"* There are {df_missing_data.shape[0]} variables with missing data \n")
df_missing_data

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
try:
  # create here your folder
  os.makedirs(name='outputs/datasets/cleaned')
  print("folder created")
except Exception as e:
  print(e)

## Cleaned Data

In [None]:
df.to_csv("outputs/datasets/cleaned/phnx_2015_2025.csv", index=False)
print("file saved")

## Train Set

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)
print("file saved")

## Test Set

In [None]:
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)
print("file saved")