# 3. Pre-Processing <a id="data_wrangling"></a>

<a id="contents"></a>
# Table of Contents  
3.1. [Introduction](#introduction) <br>
3.2. [Imports](#imports)  <br>
3.3. [Data Processing](#process)<br>
3.4. [Scale the Data](#data)<br>
3.5. [Data Splitting](#split)<br>
3.6. [Save Updated Dataframe](#save)

## 3.1 Introduction<a id="introduction"></a>

The goal of this notebook is to create a cleaned development dataset to be used to complete the modeling step of my project.

## 3.2 Imports<a id="imports"></a>

In [1]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/df_eda.csv')

In [3]:
# Convert 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])
df.reset_index(drop=True, inplace=True)
df.drop(columns=['Unnamed: 0'], inplace=True)
df

Unnamed: 0,SourceDataset,RegionID,SizeRank,RegionName,StateName,Date,Value,Quarter
0,ZORI,394913,1,"New York, NY",NY,2015-01-31,2367.192976,2015Q1
1,ZORI,394913,1,"New York, NY",NY,2015-02-28,2382.571737,2015Q1
2,ZORI,394913,1,"New York, NY",NY,2015-03-31,2401.539081,2015Q1
3,ZORI,394425,50,"Buffalo, NY",NY,2015-01-31,805.691732,2015Q1
4,ZORI,394425,50,"Buffalo, NY",NY,2015-02-28,819.385346,2015Q1
...,...,...,...,...,...,...,...,...
9877,ZORDI,394326,607,"Amsterdam, NY",NY,2024-05-31,138.000000,2024Q2
9878,ZORDI,394504,629,"Cortland, NY",NY,2024-04-30,34.000000,2024Q2
9879,ZORDI,394504,629,"Cortland, NY",NY,2024-05-31,31.000000,2024Q2
9880,ZORDI,395084,784,"Seneca Falls, NY",NY,2024-04-30,65.000000,2024Q2


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9882 entries, 0 to 9881
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   SourceDataset  9882 non-null   object        
 1   RegionID       9882 non-null   int64         
 2   SizeRank       9882 non-null   int64         
 3   RegionName     9882 non-null   object        
 4   StateName      9882 non-null   object        
 5   Date           9882 non-null   datetime64[ns]
 6   Value          9882 non-null   float64       
 7   Quarter        9882 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 617.8+ KB


## 3.3 Data Pre-processing<a id="process"></a>

In [5]:
# Define categorical and numerical columns
categorical_cols = ['RegionType', 'SourceDataset']
numerical_cols = ['Value', 'SizeRank']  # Assuming these are the numeric features

# Handling missing values
# For numerical features, replace missing values with the median
# For categorical features, replace missing values with the most frequent category
numeric_imputer = SimpleImputer(strategy='median')
categorical_imputer = SimpleImputer(strategy='most_frequent')

# Encoding categorical data
encoder = OneHotEncoder(handle_unknown='ignore')

## 3.4 Scale the Data<a id="data"></a>

In [6]:
# Scaling numerical data
scaler = StandardScaler()

# Create a preprocessing pipeline for numeric data
numeric_pipeline = Pipeline(steps=[
    ('imputer', numeric_imputer),
    ('scaler', scaler)
])

# Create a preprocessing pipeline for categorical data
categorical_pipeline = Pipeline(steps=[
    ('imputer', categorical_imputer),
    ('encoder', encoder)
])

# Combine preprocessing steps into a single transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numerical_cols),
        ('cat', categorical_pipeline, categorical_cols)
    ])

## 3.5 Data Splitting<a id="split"></a>

In [7]:
X = df.drop('Value', axis=1)  # Features
y = df['Value']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Assuming you want to scale numerical features and one-hot encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['RegionID', 'SizeRank']),  # Example numerical features
        ('cat', OneHotEncoder(), ['SourceDataset', 'StateName', 'Quarter'])  # Example categorical features
    ])

# Fit the preprocessor on the training data and transform both training and test data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# Print shapes to verify
print("Processed training features shape:", X_train_preprocessed.shape)
print("Processed test features shape:", X_test_preprocessed.shape)

Processed training features shape: (7905, 104)
Processed test features shape: (1977, 104)


## 3.6 Save Updated Dataframe<a id="save"></a>

In [8]:
# save concatenated dataframe
df.to_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/df_pp.csv')