<a href="https://colab.research.google.com/github/Soichiro-Gardinner/Prediction-of-Grocery-Sales/blob/main/Sales_Prediction_part_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sales Prediction [pt 5]
- **By:** Oscar Castanaza

In [2]:
#Importin Libraries
import pandas as pd

In [3]:
# Saving Data in df:
df = pd.read_csv("/content/Sales Pred [Week(2)].csv")

#**Fix inconsistencies in categorical data**

1.Before splitting the data, we can check for duplicates and fix inconsistencies in categorical data. 

**For example**, we can check the unique values of the 'Item_Fat_Content' column to see if there are any inconsistencies**:**

In [4]:
df['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

**_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _**

**We can see that there are inconsistencies in the naming of the categories.**

We can fix this by replacing 'LF' and 'low fat' with 'Low Fat', and 'reg' with 'Regular':

In [5]:
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'})

**_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _**

We can also **drop duplicates** using the drop_duplicates() method:

In [6]:
df = df.drop_duplicates()

**_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _**

# **Features_[X] & Target_[y]**

To identify the **features (X**) and **target (y)**, we can assign the **'Item_Outlet_Sales'** column as the target and the rest of the relevant variables as the features matrix:

In [9]:
# Features:
X = df.drop(['Item_Outlet_Sales'], axis=1)
# Target:
y = df['Item_Outlet_Sales']

**_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _**

We can perform a **train-test-split** using the **train_test_split()** unction from Scikit-Learn**:**

In [10]:
# Importing the t_t_S:
from sklearn.model_selection import train_test_split

# Splitting the data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _**

To create a preprocessing object to prepare the dataset for machine learning, we can use Scikit-Learn's ColumnTransformer and Pipeline classes.

In [11]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Defining numerical and categorical features
num_features = ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']
cat_features = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']

# Creating pipelines for numerical and categorical features
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())])
cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(sparse=False, handle_unknown='ignore'))])

# Using ColumnTransformer to combine the pipelines for numerical and categorical features
preprocessor = ColumnTransformer(transformers=[('num', num_pipeline, num_features), ('cat', cat_pipeline, cat_features)])

# Note:
- We can create separate pipelines for numerical and categorical features. 

- For numerical features, we will impute missing values using the mean strategy and scale the features using StandardScaler. 
- For categorical features, we will impute missing values using the most frequent strategy and one-hot encode the features using OneHotEncoder.



**_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _**

We can then **fit** the preprocessor on the training data **and transform** both the training and test data using the fitted preprocessor:

In [12]:
# X Train:
X_train_preprocessed = preprocessor.fit_transform(X_train)
# X Test:
X_test_preprocessed = preprocessor.transform(X_test)



In [14]:
# It Returns an array:
print(X_train_preprocessed)

[[-1.23795688  1.6066808  -0.40189546 ...  0.          0.
   0.        ]
 [ 1.61657069 -1.00844167 -0.61928426 ...  1.          0.
   0.        ]
 [ 0.         -0.22706823 -0.20253536 ...  0.          0.
   1.        ]
 ...
 [ 1.1064099  -0.91757329  1.5257291  ...  1.          0.
   0.        ]
 [ 1.75894114 -0.22428724 -0.3811361  ...  1.          0.
   0.        ]
 [ 0.80980479 -0.95575131 -0.73573148 ...  1.          0.
   0.        ]]


This Means that the code worked