# **Data Preprocessing**

## Objectives

* Import Libraries: Loaded necessary libraries for data manipulation and visualization.
* Load Data: Loaded the dataset and displayed the initial records.
* Data Exploration: Explored the data by checking for missing values and visualizing them.
* Handle Missing Values: Dropped or filled missing values.
* Data Transformation: Converted categorical variables to numeric format.
* Feature Selection: Selected features based on their correlation with the target variable.
* Split Data: Split the data into training and testing sets and saved the processed data.

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Import Libraries and Load Data:

1. Import required Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

2. Load the Dataset

In [None]:
# Load the dataset
df = pd.read_csv('../data/dataset.csv')

# Display the first few rows of the dataset
df.head()

---

# Data Exploration:

Basic Statistics

In [None]:
df.describe()

Check for Missing Values

In [None]:
df.isnull().sum()

Visualization missing Data

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

---

# Handle missing values:

Drop missing Values

In [None]:
df_cleaned = df.dropna()

Fill missing Values

In [None]:
# Fill missing values (if dropping is not feasible)
# Example: Fill with mean value
df_filled = df.fillna(df.mean())

---

# Data Transformation:

Convert categorical variables to Numeric

In [None]:
# Convert categorical variables to numeric using one-hot encoding
df_transformed = pd.get_dummies(df_cleaned, drop_first=True)

---

# Feature selection:

Correlation matrix

In [None]:
# Plot correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df_transformed.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Select features based on correlation

In [None]:
# Select features with high correlation to the target variable
corr_matrix = df_transformed.corr()
target_corr = corr_matrix['target_column'].sort_values(ascending=False)
selected_features = target_corr[target_corr > 0.1].index.tolist()

---

# Split the data into training and testing sets:

Split Data

In [None]:
# Define features (X) and target (y)
X = df_transformed[selected_features].drop('target_column', axis=1)
y = df_transformed['target_column']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Save the preprocessed data
X_train.to_csv('../data/X_train.csv', index=False)
X_test.to_csv('../data/X_test.csv', index=False)
y_train.to_csv('../data/y_train.csv', index=False)
y_test.to_csv('../data/y_test.csv', index=False)

---

# Summary and save the notebook:

## Data Preprocessing Summary

- Loaded dataset from `data/dataset.csv`.
- Explored the dataset and visualized missing values.
- Handled missing values by dropping rows with missing data.
- Converted categorical variables to numeric using one-hot encoding.
- Selected features based on correlation with the target variable.
- Split the data into training and test sets.
- Saved the preprocessed data for model training.

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
