<a href="https://colab.research.google.com/github/Sergio-Rodriguez24/Prediction-of-Product-Sales/blob/main/Preprocessing_Code_along.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Steps
1. Import all NECESSARY libraries
2. Load the dataset
3. Make copy of dataframe (in case you make a mistake)
4. Check `.info()` for datatypes and missing values
5. Check for duplicates - `df.duplicated().sum()`
  
  a. Drop duplicates if necessary `df.drop_duplicates(inplace=True)`
6. Check for impossible outliers in numeric data - `df.describe()`

  a. Drop or fix impossible outlisers if necessary - `df.describe(include='number')`

7. Check for inconsistencies in categorical data - `df.value_counts()`

  a. Change category names if necessary - `df.replace(replacement_dict, inplace=True)`
8. Split the dataset- (X, y) and (train, test)
9. preprocess the data
10. Explain and justify the strategy used in code like why are you using mean, median or mode?
11. Instantiate Transformers and selectors
12. Make pipelines (if needed) - `make_pipeline(trans1, trans2)`
13. Make tuples - `(pipeline, selector)`
10. Put pipelines into column transformer. `make_column_selector(tuple1, tuple2)`
11. Fit the column transformer to **train only**, and transform train AND test.  


In [None]:
# Imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

from sklearn.pipeline import make_pipeline

from sklearn import set_config
set_config(display='diagram')

# Task:
Predict the sale price of a diamond based on measurements of the diamond

carat
weight of the diamond

cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color
diamond colour, from J (worst) to D (best)

clarity
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x
length in mm

y
width in mm

z
depth in mm

table

width of top of diamond relative to widest point

# Steps:

1. Load the Data
2. Check for duplicates, missing values, errors
3. Split the data (X, y and train, test)
4. Instantiate the Transformers and column selectors
5. Create numeric preprocessing pipeline and categorical preprocessing pipeline
6. Combine pipelines using column transformer
7. Fit preprocessor on X_train
8. Transform both X_train and X_test
9. (Optional) convert processed X_train back to a dataframe.

# Load the Data

In [None]:
# Load in the data
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTpBcQrzjkPeQ4W_sZuOa45qqxLBgtr_wzistIB3a7gFUJ1AlBhloV5g9IxV5DFnd-GPktXDNprGOUh/pub?gid=88799590&single=true&output=csv')

In [None]:
# Display the first five rows of the dataframe
df.head()

We will make a copy of original df to avoid any manipulations

In [None]:
# Make a copy of original df to avoid any manipulations
eda_ml = df.copy()

In [None]:
# Drop unneccessary columns
eda_ml = eda_ml.drop('Unnamed: 0', axis=1)

In [None]:
# Display the first five rows of the dataframe
eda_ml.head()

# Check for Duplicated, Missing, or Erroneous Data

In [None]:
# Check to see if there are any duplicate rows
eda_ml.duplicated().sum()

In [None]:
# Drop duplicte rows
eda_ml.drop_duplicates(inplace=True)

In [None]:
# Confirm now duplicate rows remain
eda_ml.duplicated().sum()

In [None]:
# Display summary info
eda_ml.info()

In [None]:
# Display the sum of missing values
eda_ml.isna().sum().sum()

In [None]:
# Display the sum of missing values
eda_ml.isna().sum()

In [None]:
# Display desriptive statitistics for all collumns
eda_ml.describe(include='number')

In [None]:
# Display desriptive statitistics for all collumns
eda_ml.describe(include='object')

# Missing Values: Drop Rows or Impute?

In [None]:
# Check target for null values
eda_ml['price'].isna().sum()

In [None]:
# Drop rows without a Target Value
eda_ml.dropna(subset=['price'], inplace=True)

How many rows would we lose if we just dropped the rows with missing values?

In [None]:
# Percent of total rows missing values
percent_missing = (1 - eda_ml.dropna().shape[0] / df.shape[0]) * 100
print(f'{percent_missing:.4f} percent of rows are missing at least 1 value')

# Ordinal Encode 'cut' and 'clarity'

The `cut` column was pretty clear in how to order them, but the clarity column took some research.  [Selecting a Diamond](https://selectingadiamond.com/diamond-clarity/)

![Diamond Clarity Chart](https://selectingadiamond.com/wp-content/uploads/2019/10/Diamond-Clarity.jpg)

In [None]:
# Check ordinal categories
eda_ml['cut'].value_counts()


In [None]:
eda_ml['clarity'].value_counts()

In [None]:
eda_ml['cut'].unique()

# Split the Data (Validation Split)

In [None]:
# split X and y, we are predicting price
target = 'price'
X = eda_ml.drop(columns=[target]).copy()
y = eda_ml[target].copy()

# split training and test

# set random_state to 42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30 , random_state=42)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
X_train.dtypes

# Create the Pipelines and Tuples for Each Group of Columns

I'm going to divide my data into

- numeric
- nominal categorical, and
- ordinal categorical columns

and preprocess each subset differently.


## 1. Numeric

In [None]:
# PREPROCESSING PIPELINE FOR NUMERIC DATA

# Save list of number column names


# Transformers


# Pipeline



In [None]:
# Tuple


## 2. Ordinal

In [None]:
# PREPROCESSING PIPELINE FOR ORDINAL DATA

# Save list of number column names


# Ordered Category Lists


# Transformers



# you might have 100 diff cat for ordinal so its getting out of range so good to scale


# Pipeline


# Tuple



## 3. Nominal

In [None]:
# PREPROCESSING PIPELINE FOR ONE-HOT-ENCODED DATA

# Save list of nominal column names


# Transformers



# Pipeline


# Tuple


In [None]:
nominal_cols

# Create Column Transformer to Apply Different Preprocessing to Different Columns

In [None]:
# Instantiate the make column transformer
col_transformer = ColumnTransformer([numeric_tuple,
                                       ord_tuple,
                                       ohe_tuple],
                                       remainder='drop', verbose_feature_names_out=False)


# Fit the Column Transformer on the Training Data Only

In [None]:
# Fit the column transformer on the X_train
col_transformer.fit(X_train)

# Transform Both Training and Testing Data

In [None]:
# Set the default transformation output to Pandas
from sklearn import set_config
set_config(transform_output='pandas')

In [None]:
# Transform the X_train and the X_test
X_train_proc = col_transformer.transform(X_train)
X_test_proc = col_transformer.transform(X_test)

# Check the Result

In [None]:
# Display the first (5) rows of the dataframe
display(X_train_proc.head())
# Check the shape
print(f'\nshape of processed data is: {X_train_proc.shape}')
# Check for remaining missing values
print(f'\nThere are {X_train_proc.isna().sum().sum()} missing values')
# Check the data types
print(f'\nThe datatypes are {X_train_proc.dtypes}')

# (Bonus!) Preview of Next Week!  Modeling!

In [None]:
# Import model and metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

In [None]:
# Instantiate model
lr = LinearRegression()
# create model pipeline
lr_pipe = make_pipeline(col_transformer, lr)
# fit model pipeline
# this fits ALL transformers AND the model
lr_pipe.fit(X_train, y_train)
# make predictions with BOTH the training and testing set

train_predictions = lr_pipe.predict(X_train)
test_predictions = lr_pipe.predict(X_test)

In [None]:
train_predictions[:5]

In [None]:
y_train.head()

In [None]:
# Evaluate the average error of the model for train and test sets
train_error = mean_absolute_error(y_train, train_predictions)
test_error = mean_absolute_error(y_test, test_predictions)

print(f'The models average error on the training set is {train_error}')
print(f'the models average error on the testing set is {test_error}')