### Project Title: Sales Prediction Using Python
#### Done By: Nozipho Sithembiso Ndebele
---

<div style="text-align: center;">
<img src="https://assets.website-files.com/60e7f71b22c6d0b9cf329ceb/621e1a2f28ded71ee95aeede_6ProvenSalesForecastingMethodstoDriveRevenue1_a117440b5ae227c3dba5264a6521da06_2000.png" alt="Sales Image" width="1000"/>
</div>

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
Advertising is a critical component of a company’s marketing strategy. Understanding the impact of different advertising channels on product sales allows businesses to allocate their marketing budgets more effectively. In this project, we analyze the relationship between TV advertising and sales using a simple linear regression model.

### Purpose
This project aims to build a simple linear regression model to predict product sales based on TV advertising spending. By quantifying the relationship between advertising investment and resulting sales, we aim to offer data-driven insights that can support decision-making in marketing strategies.

### Significance
Analyzing the relationship between advertising and sales is beneficial for:

- Understanding the effectiveness of TV advertising as a marketing channel.

- Providing a baseline for predictive modeling and linear regression in business analytics.

- Demonstrating how simple linear regression can be used to interpret real-world relationships.

- Helping marketers and business analysts optimize advertising budgets.

This analysis serves as a practical example of applying linear regression to real-world business data.

### Problem Domain
The dataset includes advertising expenditures across three media types—TV, Radio, and Newspaper—and corresponding product sales. For this project, we focus on TV advertising as the sole independent variable to predict Sales using simple linear regression.

We aim to determine:

- Whether there is a statistically significant linear relationship between TV ad spend and Sales.

- How much of the variation in Sales can be explained by the TV advertising budget.

### Challenges
- Linearity Assumption: Ensuring a linear relationship between TV advertising and sales.

- Residual Analysis: Verifying the assumptions of linear regression (e.g., normality, homoscedasticity).

- Feature Limitation: Only using TV as the predictor while ignoring other potentially relevant variables.

- Overfitting/Underfitting: Balancing model complexity and interpretability.

### Key Questions
- Is there a strong linear relationship between TV advertising and sales?

- What is the slope of the regression line, and how should it be interpreted?

- How much of the variance in sales is explained by TV ad spend?

- Can we reliably use TV advertising as a predictor of future sales?

---
<a href=#one></a>
## **Importing Packages**

### Purpose
To set up the Python environment with the necessary libraries for data manipulation, visualization, and machine learning. These libraries will facilitate data preprocessing, feature extraction, model training, and evaluation.

### Details
* Pandas: For handling and analyzing data.

* NumPy: For numerical operations.

* Matplotlib/Seaborn: For data visualization to understand trends and patterns.

* scikit-learn: For building and evaluating machine learning models.

* NLTK/Spacy: For text preprocessing and natural language processing tasks.

---

In [14]:
# Import necessary packages  

# Data manipulation and analysis  
import pandas as pd  # Pandas for data handling  
import numpy as np  # NumPy for numerical operations  

# Data visualization  
import matplotlib.pyplot as plt  # Matplotlib for static plots  
import seaborn as sns  # Seaborn for statistical visualization  
import plotly.express as px  # Plotly for interactive plots  

# Natural Language Processing  
import nltk  # Natural Language Toolkit  
from nltk.corpus import stopwords  # Stopword removal  
from nltk.tokenize import word_tokenize  # Tokenization  
import re  # Regular expressions for text cleaning  

# Configure visualization settings
sns.set(style='whitegrid')  # Set the default style for Seaborn plots
plt.rcParams['figure.figsize'] = (10, 6)  # Set default figure size for Matplotlib

# Suppress warnings
import warnings  # Import the warnings module
warnings.filterwarnings('ignore')  # Ignore all warning messages


---
<a href=#two></a>
## **Data Collection and Description**
### Purpose
This section summarizes the structure and scope of the advertising dataset. Understanding the dataset is crucial for building an effective linear regression model and drawing valid business insights.

### Details
The dataset originates from the book "An Introduction to Statistical Learning" and captures advertising expenditure and product sales data across various markets.

- Source: Advertising dataset from ISLR (Introduction to Statistical Learning).

- Format: CSV file with 200 rows and 4 columns.

- Scope: Data includes spending on TV, Radio, and Newspaper advertising, and corresponding sales values.

### Types of Data
- `TV`	Budget spent on TV advertising (in thousands of dollars)
- `Radio`	Budget spent on radio advertising (in thousands of dollars)
- `Newspaper`	Budget spent on newspaper advertising (in thousands of dollars)
- `Sales`	Product sales (in thousands of units)

---
<a href=#three></a>
## **Loading Data**
### Purpose
The purpose of this section is to load the dataset into the notebook for further manipulation and analysis. This is the first step in working with the data, and it allows us to inspect the raw data and get a sense of its structure.

### Details
In this section, we will load the dataset into a Pandas DataFrame and display the first few rows to understand what the raw data looks like. This will help in planning the next steps of data cleaning and analysis.


---

In [15]:
# Load the dataset into a Pandas DataFrame

# The dataset is stored in a CSV file named 'Domestic violence.csv'
df = pd.read_csv('advertising.csv')

In [16]:
# df is the original dataset (DataFrame), this creates a copy of it
df_copy = df.copy()

# Now 'df_copy' is an independent copy of 'df'. Changes to 'df_copy' won't affect 'df'.


In [17]:
# Display the first few rows of the dataset to get a sense of what the raw data looks like
df_copy.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9


In [18]:
# Display the number of rows and columns in the dataset to understand its size
df_copy.shape

(200, 4)

In [19]:
# Check the structure of the dataset
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


---
<a href=#four></a>
## **Data Cleaning and Filtering**
Before analyzing the data, it is crucial to clean and filter it. This process involves handling missing values, removing outliers, correcting errors, and possibly reducing the data by filtering out irrelevant features. These steps ensure that the analysis is based on accurate and reliable data.

Details
In this section, we will:

* Check for Missing Values: Identify if there are any missing values in the dataset and handle them accordingly.
* Remove Duplicates: Ensure there are no duplicate rows that could skew the analysis.
* Correct Errors: Look for and correct any obvious data entry errors.
* Filter Data: Depending on the analysis requirements, filter the data to include only relevant records.

In [20]:
# 1. Check for missing values in the dataset

def check_missing_values(df):
    """
    Check for missing values in the dataset and display the number of missing values per column.

    Parameters:
    df (pandas.DataFrame): The dataset to check for missing values.

    Returns:
    pandas.Series: A series showing the number of missing values for each column.
    """
     # Check for missing values in the dataset and display them
    print("Missing values per column:")
    missing_values = df.isnull().sum()
    print(missing_values)
    return missing_values


In [21]:
# Assuming df is your DataFrame
missing_values = check_missing_values(df_copy)


Missing values per column:
TV           0
Radio        0
Newspaper    0
Sales        0
dtype: int64


After examining the dataset, no missing values were found across any of the columns. This ensures data completeness and eliminates the need for imputation or further cleaning related to missing data.


In [22]:
def remove_duplicates(df):
    """
    Checks for duplicate rows in the dataset and removes them if any are found.

    Args:
    df (pandas.DataFrame): The dataframe to check for duplicate rows.

    Returns:
    pandas.DataFrame: The dataframe with duplicate rows removed, if any existed.
    """
    # Check for duplicate rows
    duplicate_rows = df.duplicated().sum()
    print(f"\nNumber of duplicate rows: {duplicate_rows}")
    
    # Remove duplicates if any exist
    if duplicate_rows > 0:
        df.drop_duplicates(inplace=True)
        print(f"Duplicate rows removed. Updated dataframe has {len(df)} rows.")
    else:
        print("No duplicate rows found.")
    
    return df

In [23]:
df_copy = remove_duplicates(df_copy)


Number of duplicate rows: 0
No duplicate rows found.


Upon reviewing the dataset, no duplicate rows were found. This ensures that all records are unique, and no further action is required for data deduplication.


## **Saving the Cleaned Dataset**
### Purpose

This section outlines how to save the cleaned dataset for future use. Saving the dataset ensures that the data cleaning process does not need to be repeated and allows for consistent use in subsequent analyses.

### Details

We will save the cleaned dataset as a CSV file.

In [24]:
#6. Save the cleaned dataset to a new CSV file

def save_cleaned_dataset(df, filename='cleaned_advertising.csv'):
    """
    Saves the cleaned dataframe to a CSV file.

    Args:
    df (pandas.DataFrame): The cleaned dataframe to save.
    filename (str): The name of the file to save the dataframe to (default is 'cleaned_domestic_violence.csv').

    Returns:
    None
    """
    # Save the cleaned dataset to a CSV file
    df.to_csv(filename, index=False)
    print(f"Cleaned dataset saved successfully as '{filename}'.")


In [25]:
save_cleaned_dataset(df_copy)


Cleaned dataset saved successfully as 'cleaned_advertising.csv'.


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**

It is the process of analyzing datasets to summarize key features, often through visualization methods. It aims to discover patterns, spot anomalies, and formulate hypotheses for deeper insights, which informs subsequent analysis.
#### Advantages

- Helps in understanding the data before modeling.
- Provides insights that guide feature selection and engineering.
- Assists in choosing appropriate modeling techniques.
- Uncovers potential data quality issues early.

`The following methods were employed to communicate our objective:`



---


---
<a href=#nine></a>
## **Conclusion and Future Work**


##### Conclusion



##### Future Work

To build upon this study, future work could focus on the following areas:



---
<a href=#ten></a>
## **References**

## Additional Sections to Consider

**Contributors**: Nozipho Sithembiso Ndebele
