### **Scaling in Machine Learning**

 * Scaling is the process of transforming data into a specific range or distribution so that machine learning models can process the data more effectively and efficiently.
 * Scaling ensures that all features contribute equally to the model's performance by putting them on the same scale. It also helps some machine learning algorithms perform better by improving convergence and stability during optimization.

 * MinMax scaling transforms the features so that they lie within a specific range, typically [0, 1].
 * Standard scaling transforms the data so that it has a mean of 0 and a standard deviation of 1. It does not bound the values to a specific range, but rather focuses on centering and scaling the data to have a normal distribution.

### **1. Importing Required Libraries**

We start by importing essential libraries for data manipulation, visualization, and scaling.

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import xlrd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler


### **2. Loading the Dataset**

The dataset is loaded from an Excel file. Ensure the file path is correctly set for your environment.

In [2]:
df = pd.read_excel("E:\\Machine Learning\\global_superstore\\Global Superstore.xls")

### **3. Exploring the Dataset**

- **Displaying the First Few Rows**: This gives a glimpse of the first few rows of the dataset.
- **Listing All Columns**: To understand the structure of the dataset, we list all column names.
- **Dataset Shape**: Shows the number of rows and columns in the dataset.
- **Dataset Information**: Provides an overview of the dataset, including column data types and non-null counts.
- **Checking for Missing Values**: Summarizes the count of missing values in each column.

In [3]:
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,City,State,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority
0,32298,CA-2012-124891,2012-07-31,2012-07-31,Same Day,RH-19495,Rick Hansen,Consumer,New York City,New York,...,TEC-AC-10003033,Technology,Accessories,Plantronics CS510 - Over-the-Head monaural Wir...,2309.65,7,0.0,762.1845,933.57,Critical
1,26341,IN-2013-77878,2013-02-05,2013-02-07,Second Class,JR-16210,Justin Ritter,Corporate,Wollongong,New South Wales,...,FUR-CH-10003950,Furniture,Chairs,"Novimex Executive Leather Armchair, Black",3709.395,9,0.1,-288.765,923.63,Critical
2,25330,IN-2013-71249,2013-10-17,2013-10-18,First Class,CR-12730,Craig Reiter,Consumer,Brisbane,Queensland,...,TEC-PH-10004664,Technology,Phones,"Nokia Smart Phone, with Caller ID",5175.171,9,0.1,919.971,915.49,Medium
3,13524,ES-2013-1579342,2013-01-28,2013-01-30,First Class,KM-16375,Katherine Murray,Home Office,Berlin,Berlin,...,TEC-PH-10004583,Technology,Phones,"Motorola Smart Phone, Cordless",2892.51,5,0.1,-96.54,910.16,Medium
4,47221,SG-2013-4320,2013-11-05,2013-11-06,Same Day,RH-9495,Rick Hansen,Consumer,Dakar,Dakar,...,TEC-SHA-10000501,Technology,Copiers,"Sharp Wireless Fax, High-Speed",2832.96,8,0.0,311.52,903.04,Critical


In [4]:
df.columns

Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country',
       'Postal Code', 'Market', 'Region', 'Product ID', 'Category',
       'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount',
       'Profit', 'Shipping Cost', 'Order Priority'],
      dtype='object')

In [5]:
df.shape

(51290, 24)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Row ID          51290 non-null  int64         
 1   Order ID        51290 non-null  object        
 2   Order Date      51290 non-null  datetime64[ns]
 3   Ship Date       51290 non-null  datetime64[ns]
 4   Ship Mode       51290 non-null  object        
 5   Customer ID     51290 non-null  object        
 6   Customer Name   51290 non-null  object        
 7   Segment         51290 non-null  object        
 8   City            51290 non-null  object        
 9   State           51290 non-null  object        
 10  Country         51290 non-null  object        
 11  Postal Code     9994 non-null   float64       
 12  Market          51290 non-null  object        
 13  Region          51290 non-null  object        
 14  Product ID      51290 non-null  object        
 15  Ca

In [7]:
df.isnull().sum()

Row ID                0
Order ID              0
Order Date            0
Ship Date             0
Ship Mode             0
Customer ID           0
Customer Name         0
Segment               0
City                  0
State                 0
Country               0
Postal Code       41296
Market                0
Region                0
Product ID            0
Category              0
Sub-Category          0
Product Name          0
Sales                 0
Quantity              0
Discount              0
Profit                0
Shipping Cost         0
Order Priority        0
dtype: int64

**4. Preprocessing the Data**

- **Dropping Irrelevant Columns**: The Postal Code column is dropped as it may not contribute to the analysis.
- **Separating Numerical and Categorical Data**: This separates numerical columns from categorical ones for further processing.

In [8]:
df.drop(columns = ['Postal Code'], axis = 1,inplace = True)

In [9]:
df_num = df.select_dtypes(include = [np.number])
df_cat = df.select_dtypes(include = ['object'])

In [10]:
df_num.describe()

Unnamed: 0,Row ID,Sales,Quantity,Discount,Profit,Shipping Cost
count,51290.0,51290.0,51290.0,51290.0,51290.0,51290.0
mean,25645.5,246.490581,3.476545,0.142908,28.610982,26.375818
std,14806.29199,487.565361,2.278766,0.21228,174.340972,57.29681
min,1.0,0.444,1.0,0.0,-6599.978,0.002
25%,12823.25,30.758625,2.0,0.0,0.0,2.61
50%,25645.5,85.053,3.0,0.0,9.24,7.79
75%,38467.75,251.0532,5.0,0.2,36.81,24.45
max,51290.0,22638.48,14.0,0.85,8399.976,933.57


In [11]:
df_cat.describe(include = 'all')

Unnamed: 0,Order ID,Ship Mode,Customer ID,Customer Name,Segment,City,State,Country,Market,Region,Product ID,Category,Sub-Category,Product Name,Order Priority
count,51290,51290,51290,51290,51290,51290,51290,51290,51290,51290,51290,51290,51290,51290,51290
unique,25035,4,1590,795,3,3636,1094,147,7,13,10292,3,17,3788,4
top,CA-2014-100111,Standard Class,PO-18850,Muhammed Yedwab,Consumer,New York City,California,United States,APAC,Central,OFF-AR-10003651,Office Supplies,Binders,Staples,Medium
freq,14,30775,97,108,26518,915,2001,9994,11002,11117,35,31273,6152,227,29433


**5. Data Description**

- **Numerical Columns**: Provides descriptive statistics for numerical columns.
- **Categorical Columns**: Gives a summary of categorical columns, including unique values and frequency.

In [12]:
print(df_num.columns)
print("--------------------------------")
print(df_cat.columns)

Index(['Row ID', 'Sales', 'Quantity', 'Discount', 'Profit', 'Shipping Cost'], dtype='object')
--------------------------------
Index(['Order ID', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment',
       'City', 'State', 'Country', 'Market', 'Region', 'Product ID',
       'Category', 'Sub-Category', 'Product Name', 'Order Priority'],
      dtype='object')


**6. Encoding Categorical Data & Outlier removal in Numerical Data**

 * Categorical columns are encoded using one-hot encoding, converting them into numerical features suitable for machine learning.
 * Numerical columns are modified by removing outliers and the method used to remove outliers is IQR method.

In [13]:
encoder = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')
encoded_array = encoder.fit_transform(df_cat)
encoded_df = pd.DataFrame(encoded_array, columns = encoder.get_feature_names_out(df_cat.columns))

In [14]:
encoded_df.shape

(51290, 46428)

In [15]:
def remove_outliers_iqr(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)  # First quartile (25th percentile)
        Q3 = df[col].quantile(0.75)  # Third quartile (75th percentile)
        IQR = Q3 - Q1               # Interquartile range
        lower_bound = Q1 - 1.5 * IQR  # Lower bound
        upper_bound = Q3 + 1.5 * IQR  # Upper bound

        # Remove outliers
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    
    return df

In [16]:
df_num.columns

Index(['Row ID', 'Sales', 'Quantity', 'Discount', 'Profit', 'Shipping Cost'], dtype='object')

In [17]:
columns_to_check = ['Row ID','Sales', 'Quantity', 'Discount', 'Profit',
       'Shipping Cost']
df_no_outliers = remove_outliers_iqr(df_num,  columns_to_check)

print("Original DataFrame")
print(df_num.shape)
print("DataFrame after Outlier Treatment")
print(df_no_outliers.shape)

Original DataFrame
(51290, 6)
DataFrame after Outlier Treatment
(30879, 6)


### Scaling the data

**7. Scaling the Data**

- **MinMaxScaler**: Scales features to a range between 0 and 1, ensuring no feature dominates due to its scale.
- **StandardScaler**: Standardizes features by removing the mean and scaling to unit variance.

**1. MinMax Scaler**

In [18]:
columns_to_scale = ['Row ID','Sales', 'Quantity', 'Discount', 'Profit',
       'Shipping Cost']

minmax_scaler = MinMaxScaler()
df_minmax_scaled = df_no_outliers.copy()
df_minmax_scaled[columns_to_scale] = minmax_scaler.fit_transform(df_no_outliers[columns_to_scale])

In [19]:
df_minmax_scaled.head()

Unnamed: 0,Row ID,Sales,Quantity,Discount,Profit,Shipping Cost
12068,0.649399,0.276564,0.166667,0.4,0.696868,1.0
12069,0.708417,0.345885,0.5,0.6,0.0104,1.0
12073,0.208661,0.463967,0.5,0.0,0.433239,0.999243
12074,0.59077,0.315389,0.5,0.8,0.443606,0.999243
12078,0.612451,0.225215,0.833333,0.4,0.772315,0.998486


**2. Standard Scaler**

In [20]:
standard_scaler = StandardScaler()
df_standard_scaled = df_no_outliers.copy()
df_standard_scaled[columns_to_scale] = standard_scaler.fit_transform(df_no_outliers[columns_to_scale])

In [21]:
df_standard_scaled.head()

Unnamed: 0,Row ID,Sales,Quantity,Discount,Profit,Shipping Cost
12068,0.55465,1.208948,-0.530879,0.734589,1.419094,3.27575
12069,0.759078,1.760251,0.718684,1.406908,-2.778887,3.27575
12073,-0.971982,2.699341,0.718684,-0.61005,-0.193081,3.272457
12074,0.351571,1.517718,0.718684,2.079228,-0.129688,3.272457
12078,0.426671,0.800576,1.968248,0.734589,1.880483,3.269164


In [22]:
final_df_1 = pd.concat([df_standard_scaled, encoded_df], axis = 1)

In [23]:
final_df_1.shape

(51290, 46434)

In [24]:
final_df_1.columns

Index(['Row ID', 'Sales', 'Quantity', 'Discount', 'Profit', 'Shipping Cost',
       'Order ID_AE-2011-9160', 'Order ID_AE-2013-1130',
       'Order ID_AE-2013-1530', 'Order ID_AE-2014-2840',
       ...
       'Product Name_iHome FM Clock Radio with Lightning Dock',
       'Product Name_iKross Bluetooth Portable Keyboard + Cell Phone Stand Holder + Brush for Apple iPhone 5S 5C 5, 4S 4',
       'Product Name_iOttie HLCRIO102 Car Mount',
       'Product Name_iOttie XL Car Mount',
       'Product Name_invisibleSHIELD by ZAGG Smudge-Free Screen Protector',
       'Product Name_netTALK DUO VoIP Telephone Service',
       'Order Priority_Critical', 'Order Priority_High', 'Order Priority_Low',
       'Order Priority_Medium'],
      dtype='object', length=46434)

**Conclusion**

This notebook demonstrated the steps to preprocess data, encode categorical features, and scale numerical features using MinMaxScaler and StandardScaler. These steps are crucial for preparing data for machine learning models.