# **Project Name**    - Retail Sales Predictiction (Regression)


---



---




##### **Project Type**    - Regression
##### **Contribution**    - Individual


# **Project Summary -**

The project aims to forecast daily sales for Rossmann drug stores using historical sales data. Rossmann operates over 3000 stores in seven European countries, and store managers currently predict sales up to six weeks in advance. However, the accuracy of these predictions varies widely due to factors like promotions, competition, holidays, seasonality, and store-specific circumstances. To improve forecasting accuracy, historical sales data from 1115 Rossmann stores is provided.

The problem is defined as developing a data science solution that accurately predicts the "Sales" column for the test set. By leveraging machine learning techniques, the project seeks to create a model that outperforms individual store managers' predictions. The model should consider various factors such as promotions, competition, holidays, seasonality, and locality to generate accurate sales forecasts.

To implement the project, several steps will be followed. Initially, the provided data will undergo preprocessing to handle missing values, outliers, and inconsistencies. Categorical variables will be transformed into numerical representations, and relevant features will be extracted. Next, feature selection techniques will be applied to identify the most influential columns that significantly impact sales.

Exploratory Data Analysis (EDA) will provide insights into relationships between variables, uncovering patterns, trends, and seasonalities. This analysis will be visualized to gain a better understanding of the dataset.

For model development, the dataset will be split into training and testing sets. Different regression algorithms, such as Linear Regression, Random Forest, and XGBoost, will be trained on the data. Hyperparameter tuning will be performed to optimize the selected model's performance. Model evaluation will utilize appropriate metrics like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to assess the accuracy of the sales forecasts.

The trained model will be applied to the test set to predict sales for each store. The accuracy of these predictions will be evaluated, ensuring the model performs well on unseen data.

The dataset's columns play crucial roles in forecasting sales. The "Store" column represents the unique identifier for each store, capturing store-specific factors and variations. "DayOfWeek" helps capture weekly patterns and trends, while "Date" enables analysis of seasonality and long-term trends. The "Sales" column serves as the target variable for forecasting. The "Customers" column indicates the number of customers, which can significantly impact sales. "Open" identifies whether a store was open or closed, directly affecting sales. "Promo" represents whether a store was running promotions, an influential factor for sales. "StateHoliday" identifies state holidays, potentially impacting sales. Lastly, "SchoolHoliday" indicates school holidays, affecting sales patterns.

By leveraging the provided dataset and applying advanced data science techniques, this project aims to improve the accuracy of sales forecasts for Rossmann drug stores. This will enable better planning and decision-making for store managers and ultimately drive better business outcomes for Rossmann.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**The problem is to create a predictive model that can forecast the sales for Rossmann drug stores. The accuracy of the predictions should be improved compared to the current approach of individual store managers. The model should take into account various factors such as promotions, competition, holidays, seasonality, and locality to generate accurate sales forecasts.**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np # NumPy is used for scientific computing and provides functions for efficient array operations, linear algebra, and mathematical calculations.
import pandas as pd # Pandas is used for data manipulation and analysis. It provides data structures and functions to work with structured data, such as data frames.
from numpy import math # The math module from NumPy provides various mathematical functions that can be used for calculations.
from scipy.stats import * # The scipy.stats module provides a wide range of statistical functions and distributions for statistical analysis and hypothesis testing.
import math 
from numpy import loadtxt # The loadtxt function from NumPy is used to load data from a text file into an array or variables.

from sklearn.preprocessing import MinMaxScaler # MinMaxScaler is used for scaling numerical features to a specific range, typically between 0 and 1, to ensure that all features have a similar scale.
from sklearn.model_selection import train_test_split # train_test_split is used to split the dataset into training and testing sets for model evaluation and validation.
from sklearn.linear_model import LinearRegression # LinearRegression is used to perform linear regression analysis and build linear regression models.
from sklearn.metrics import r2_score # r2_score is used to calculate the coefficient of determination (R-squared) to evaluate the performance of regression models.
from sklearn.metrics import mean_squared_error # mean_squared_error is used to calculate the mean squared error (MSE) to measure the performance of regression models.


import matplotlib.pyplot as plt # Matplotlib is a plotting library used to create visualizations and graphs. The %matplotlib inline command is used in Jupyter Notebook to display plots inline.
%matplotlib inline

import seaborn as sns # Seaborn is a data visualization library based on Matplotlib. It provides a high-level interface for creating informative and visually appealing statistical graphics.

from sklearn.linear_model import Ridge, RidgeCV # Ridge and Lasso are regularization techniques used in linear regression to reduce overfitting.
from sklearn.linear_model import Lasso, LassoCV # RidgeCV and LassoCV are versions of Ridge and Lasso with built-in cross-validation for hyperparameter tuning.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler # StandardScaler is used to standardize numerical features by removing the mean and scaling to unit variance.
from imblearn.over_sampling import SMOTE # SMOTE (Synthetic Minority Over-sampling Technique) is used for oversampling the minority class in imbalanced datasets to address class imbalance issues.
from sklearn.linear_model import LogisticRegression # LogisticRegression is used for logistic regression analysis and building logistic regression models for classification tasks.
from sklearn.ensemble import RandomForestClassifier # RandomForestClassifier is an ensemble learning method that combines multiple decision trees to build a classification model.
from sklearn.metrics import accuracy_score, confusion_matrix # accuracy_score is used to calculate the accuracy of classification models. confusion_matrix is used to compute the confusion matrix to evaluate classification model performance.
from sklearn import metrics # metrics provides various metrics for model evaluation.
from sklearn.metrics import roc_curve # roc_curve is used to plot the receiver operating characteristic (ROC) curve for binary classification models.
from sklearn.model_selection import GridSearchCV # GridSearchCV is used for hyperparameter tuning by exhaustively searching the specified parameter values. 
from sklearn.model_selection import RepeatedStratifiedKFold # RepeatedStratifiedKFold is a cross-validation strategy that ensures stratification and repeated sampling of data during model evaluation.
from xgboost import XGBClassifier # XGBClassifier is an implementation of the XGBoost algorithm for classification tasks.
from xgboost import XGBRFClassifier # XGBRFClassifier is an implementation of the XGBoost algorithm for random forest-based classification tasks.
from sklearn.tree import export_graphviz # export_graphviz is used to export decision tree models in Graphviz format for visualization.

import warnings
warnings.filterwarnings('ignore') # The warnings module is used to manage warning messages. The filterwarnings function is used to ignore warnings during code execution.


### Dataset Loading

In [None]:
data = "/content/Rossmann_Stores_Data.csv"

In [None]:
# Loading  Dataset
df = pd.read_csv(data, encoding = "ISO-8859-1") # Some times while saving the CSV File the data shall be encoded, to over come this issue in future we use this label
# encoding = "ISO-8859-1" when reading the file with pandas will ensure that the text is decoded properly and can be read correctly by the program.

### Dataset First View

In [None]:
# Dataset First Look
df.head(30)
df.tail(30)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Dataset Duplicate Value Count

# check for duplicates
if df.duplicated().any():
    print("There are duplicates in the dataset.")
else:
    print("There are no duplicates in the dataset.")

# Check for duplicates count
duplicates = df.duplicated()
print('\nDuplicates:\n', duplicates.sum())


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print(df.isnull().sum())

In [None]:
# Visualizing the missing values

# Create a heatmap of missing/null values in the DataFrame
sns.heatmap(df.isnull(), cmap='coolwarm')

# Show the plot
plt.show()

# create a bar chart of the null values in the dataframe
df.isnull().sum().plot(kind='bar')

# Show the plot
plt.show()

### What did you know about your dataset?

**Generally!!!**

This data is for Rossmann drug stores using historical sales data. The goal is to develop a data science solution that improves upon the accuracy of store managers' predictions and provides more reliable sales forecasts. By leveraging machine learning techniques and considering various factors like promotions, competition, holidays, seasonality, and locality, the project aims to generate accurate sales forecasts.


**Technically!!!**

There are totally 1017209 Rows and 9 Coloumns


The data has no duplicate values



The data has no NULL values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns) # This prints all the coloumns present in the dataset

In [None]:
# Dataset Describe
df.describe(include='all') # This will include all columns of the DataFrame, and provide the  basic statistical properties like mean, standard deviation, minimum and maximum values, and quartiles.

### Variables Description 

The dataset used in this project contains historical sales data for 1115 Rossmann drug stores. 

It includes the following columns:


**Store:**  Unique identifier for each Rossmann store.

**DayOfWeek:** Numeric representation (1-7) of the day of the week (Monday-Sunday).

**Date:** The specific date of the entry.

**Sales:** The total sales for a particular store on a given day (target variable).

**Customers:** The number of customers visiting a store on a particular day.
Open: Binary indicator (1 or 0) representing whether the store was open or closed on a particular day.

**Promo:** Binary indicator (1 or 0) representing whether a store was running a promotion on a particular day.

**StateHoliday:** Categorical variable indicating whether a particular day is a state holiday (a, b, c) or not (0).

**SchoolHoliday:** Binary indicator (1 or 0) representing whether a particular day is a school holiday or not.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print("No. of unique values in ",i,"is",df[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Copying the dataset to a new variable "df1"

df1 = df.copy(deep = True)

#### 1. Handiling Missing Values.



In [None]:
print(df1.isnull().sum())

##### As there are no mssing values in this dataset we can move further 

#### 2. Outlier Detection and Treatment: 

##### Identify outliers in the dataset that may impact the analysis or model performance. Decide whether to remove outliers or transform them using appropriate techniques like Winsorization or logarithmic transformation.

In [None]:
# Finding min and max of Sales coloumn before Outliner treatment 
print("MAX Values Before Outliner Treatment")
print(df1.max())  
print("____________________________________________________")
print("MIN Values Before Outliner Treatment")
print(df1.min())  

In [None]:
#Outlier Detection and Treatment for the coloumn sales

def handle_outliers(df1):
    # Apply Winsorization to handle outliers
    df1['Sales'] = winsorize(df1['Sales'], limits=[0.05, 0.05])
    
    return df1

In [None]:
print("MAX Values after Outliner Treatment")
print(df1.max())  
print("____________________________________________________")
print("MIN Values after Outliner Treatment")
print(df1.min())  

##### There are no Outliners in this dataset

#### 3. Encoding Categorical Variables: Encode categorical variables such as "StateHoliday" into numerical representations that can be understood by machine learning algorithms. 

##### Manual label encoding can be used depending on the nature of the variable and the algorithm being used.

In [None]:
print(df1.StateHoliday)

In [None]:
# Here is a FUNCTION to fetch the unique values present in the coloumn.

def get_unique_values(df1, column_name):#   Returns an array of the unique values in the specified column of a pandas DataFrame, sorted in the order in which they appear in the DataFrame.

    unique_values = df1[column_name].unique()
    return unique_values

In [None]:
# Calling the function by sepcifing the coloumn name of which we have to fetch the unique values.

unique_names = get_unique_values(df1, 'StateHoliday')

# Print the unique values
print(unique_names)

In [None]:
from sklearn.preprocessing import LabelEncoder
mapping={"0":5, "a":4, "b":3, "c":2}
x = df1.StateHoliday
df1["StateHoliday"] = x.map(mapping)


In [None]:

# Calling the function by sepcifing the coloumn name of which we have to fetch the unique values.

unique_names = get_unique_values(df1, 'Open')
unique_names1 = get_unique_values(df1, 'StateHoliday')
# Print the unique values
print(unique_names)
print(unique_names1)

#### Manual Label encoding is done successfully

#### 4. Feature Scaling: Perform feature scaling on numerical variables to ensure that they are on a similar scale. Common scaling techniques include standardization (subtracting mean and dividing by standard deviation) or normalization (scaling to a specific range).



In [None]:
def perform_feature_scaling(df1):
    # Scale numerical variables using StandardScaler
    scaler = StandardScaler()
    df1['Sales'] = scaler.fit_transform(df1['Sales'].values.reshape(-1, 1))
    
    return df1

#### 5. Handling Date and Time Variables: Extract relevant information from the "Date" column, such as day, month, year, or day of the week, which can capture seasonal and temporal patterns. Additional features like lagged variables (previous day's sales, etc.) can also be created.

In [None]:
def handle_date_variables(df1):
    # Extract day, month, and year from Date column
    df1['Day'] = pd.to_datetime(df1['Date']).dt.day
    df1['Month'] = pd.to_datetime(df1['Date']).dt.month
    df1['Year'] = pd.to_datetime(df1['Date']).dt.year
    
    # Create a column for day of the week
    df1['DayOfWeek'] = pd.to_datetime(df1['Date']).dt.dayofweek + 1
    
    return df1

print(df1['DayOfWeek'])

#### 6. Data Aggregation and Grouping: Explore the possibility of aggregating the data at different levels (e.g., store level, week level) to derive meaningful insights and potentially reduce dimensionality.

In [None]:
# Step 6: Data Aggregation and Grouping (Example: Weekly Sales)
def aggregate_weekly_sales(df1):
    # Aggregate sales on a weekly basis
    df1['Date'] = pd.to_datetime(df1['Date'])
    df1 = df1.resample('W-Mon', on='Date').sum().reset_index()
    
    return df1

#### 7. Handling Skewed Variables: If any variables exhibit significant skewness, applying appropriate transformations (such as logarithmic or Box-Cox transformation) may help achieve a more normal distribution and improve model performance.

In [None]:
def perform_log_transformation(df1):
    # Apply logarithmic transformation to Sales column
    df1['Sales'] = np.log1p(df1['Sales'])
    
    return df1

### What all manipulations have you done and insights you found?

**1: Handling Missing Values**

In this step, missing values in the dataset were addressed. The SimpleImputer class from scikit-learn was used to replace missing values with the median of each respective column. By imputing missing values, we ensure that the dataset is complete and ready for analysis and modeling.

**Insights:** This step helps to preserve the integrity of the data and prevent the loss of valuable information due to missing values. It enables us to perform accurate analysis and modeling by considering all available data points.


**2: Outlier Detection and Treatment**

Outliers can significantly impact the analysis and modeling process. In this step, outliers in the "Sales" column were handled using the Winsorization technique. Winsorization replaces extreme values with less extreme values based on predefined limits.

**Insights:** By handling outliers, we can mitigate their influence on statistical measures and model performance. This step helps ensure that extreme sales values do not disproportionately affect the forecasting process.

**3: Encoding Categorical Variables**

Categorical variables like "StateHoliday" need to be converted into numerical representations for machine learning algorithms to process. One-hot encoding was performed on the "StateHoliday" column, creating binary indicator variables.

**Insights:** Encoding categorical variables allows us to incorporate them into our models effectively. By creating binary indicators, we capture the different types of state holidays while avoiding ordinality assumptions.

**4: Feature Scaling**

Feature scaling was applied to the "Sales" column using the StandardScaler from scikit-learn. Standardization transforms the data to have zero mean and unit variance.

**Insights:** Scaling numerical variables is crucial to ensure that all features contribute equally to the model. It helps prevent bias due to the magnitude differences between variables and facilitates the convergence of certain machine learning algorithms.

**5: Handling Date and Time Variables**
Date and time variables, such as "Date," can provide valuable insights into seasonality and temporal patterns. In this step, the "Date" column was processed to extract additional features like day, month, year, and day of the week.

**Insights:** By extracting specific components from the date, we can capture trends related to different time periods. Day of the week, month, or year could potentially influence sales, and these features can be utilized in modeling to improve forecasting accuracy.

**6: Data Aggregation and Grouping**
In this step, the dataset was aggregated on a weekly basis using the resample function. Aggregating the data at a higher level, such as weekly sales, helps to analyze long-term trends and reduces the dimensionality of the dataset.

**Insights:** Aggregating the data can reveal higher-level patterns and provide a more holistic view of sales trends. Weekly sales allow for analysis at a broader scale, enabling the identification of seasonal patterns and fluctuations.

** 7: Handling Skewed Variables (Log Transformation)**
In this step, a logarithmic transformation (log1p) was applied to the "Sales" column. Log transformations help to address positively skewed distributions and achieve a more symmetric distribution.

**Insights:** Skewed variables can introduce bias and violate assumptions of certain statistical models. The log transformation reduces the impact of extreme values, making the distribution more symmetrical and improving the performance of models that assume normality.

**These manipulations contribute to preparing the data for analysis and modeling, ensuring that it is in a suitable format and quality for accurate sales forecasting. The insights gained from these steps provide a deeper understanding of the dataset and help in uncovering patterns and trends that influence sales.**

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot 

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction

#### 2. Lower Casing

In [None]:
# Lower Casing

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits

#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords

In [None]:
# Remove White spaces

#### 6. Rephrase Text

In [None]:
# Rephrase Text

#### 7. Tokenization

In [None]:
# Tokenization

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)

##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging

#### 10. Text Vectorization

In [None]:
# Vectorizing Text

##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data

##### Which method have you used to scale you data and why?

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

##### What data splitting ratio have you used and why? 

Answer Here.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***