# Data Ingestion

In [1]:
# Importing important libraries
import pandas as pd

# Loading dataset

In [2]:
# Specify the absolute dataset path
dataset_path = "../dataset/ads.csv"

In [3]:
# import dataset and converting into dataframe to analysis
df = pd.read_csv(dataset_path)

# About Dataset
This dataset appears to contain data related to advertising expenditures across different mediums (TV, radio, newspaper) and their corresponding sales figures. Here's a breakdown of the columns:

1. **Unnamed: 0**: This column seems to be an index or identifier for each entry in the dataset.
2. **TV**: This column likely represents the advertising expenditure on TV for each entry.
3. **Radio**: This column likely represents the advertising expenditure on radio for each entry.
4. **Newspaper**: This column likely represents the advertising expenditure on newspapers for each entry.
5. **Sales**: This column represents the sales figures corresponding to each set of advertising expenditures.

Each row in the dataset seems to represent a particular instance where advertising expenditures on TV, radio, and newspapers were made, and the resulting sales figure. 

The goal of this dataset might be to analyze the impact of advertising expenditures across different mediums on sales. It could be used for regression analysis or predictive modeling to understand how changes in advertising spending influence sales.
:

# Data Profiling and Inspection

## 1. Data Size

In [4]:
df.shape

(200, 5)

* This indicates that this dataset consists of two dimensions, comprising a total of 200 rows and 5 columns.

## 2. Data Preveiw

In [5]:
# to preview data  .head() or .sample() method is used
df.head(3) # retrun first 3 rows of dataset

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3


In [6]:
df.sample(3)   # return randomly any 3 rows of dataset

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
83,84,68.4,44.5,35.6,13.6
154,155,187.8,21.1,9.5,15.6
186,187,139.5,2.1,26.6,10.3


# 3. Data types

In [7]:
# to view the types of data in dataset, .dtypes or .info() is used
df.dtypes   # return the datatypes of each columns of dataset only

Unnamed: 0      int64
TV            float64
radio         float64
newspaper     float64
sales         float64
dtype: object

In [8]:
df.info()   # retrun the overall information of dataset along with data types of each columns in dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  200 non-null    int64  
 1   TV          200 non-null    float64
 2   radio       200 non-null    float64
 3   newspaper   200 non-null    float64
 4   sales       200 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 7.9 KB


### Based on the information provided above about the dataset, we can conclude that...
* total number of rows/enteries is 200.
* total number of columns is 5.
* there are one integer data types and four are float types.
* TV, radio, newspaper and sales are float data type.
* also, all columns has 200 non null that means there is no any missing values.

# 4. Missing Values

In [9]:
# to check missing values .isnull() or .isna() method is used
df.isna().sum()

Unnamed: 0    0
TV            0
radio         0
newspaper     0
sales         0
dtype: int64

#### As we can see there is no any missing values in this dataset.

# 5. Statistical Overview

In [10]:
# to overview statistical information .describe() is used
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,200.0,100.5,57.879185,1.0,50.75,100.5,150.25,200.0
TV,200.0,147.0425,85.854236,0.7,74.375,149.75,218.825,296.4
radio,200.0,23.264,14.846809,0.0,9.975,22.9,36.525,49.6
newspaper,200.0,30.554,21.778621,0.3,12.75,25.75,45.1,114.0
sales,200.0,14.0225,5.217457,1.6,10.375,12.9,17.4,27.0


#### Based on the statistical overview provided above about the dataset, we can conclude that...
* every columns has same numbers of count.
* TV:
    1. Tv ads: minimum ads on TV is 0.7 and maximum ads on Tv is 296.4.
    2. center tendency mean is 147.0425 with standard deviation is 85.854236.
* radio:
    1. radia ads: minimum is 0 and maximum is 49.6.
    2. mean is 23.2640 and standard deviation is 14.846809.
* newspaper:
    1. newspaper ads: minimum is 0.3 and maximum is 114.0.
    2. mean is 30.5540 and standard deviation is 21.778621.
* sales:
    1. sales : minimum is 1.6 and maximum is 27.0.
    2. mean is 14.0225	 and standard deviation is 5.217457.

# 6. Data Skewness

In [11]:
# to check the skewness of numerical features .skew() method is used
df.skew()

Unnamed: 0    0.000000
TV           -0.069853
radio         0.094175
newspaper     0.894720
sales         0.407571
dtype: float64

#### It's evident that "TV" exhibit negative skewness, indicating that their medians surpass their means

# 7. Correlation Analysis

In [12]:
# to visualize the correlation of numerical features .corr() method is used
df.corr()

Unnamed: 0.1,Unnamed: 0,TV,radio,newspaper,sales
Unnamed: 0,1.0,0.017715,-0.11068,-0.154944,-0.051616
TV,0.017715,1.0,0.054809,0.056648,0.782224
radio,-0.11068,0.054809,1.0,0.354104,0.576223
newspaper,-0.154944,0.056648,0.354104,1.0,0.228299
sales,-0.051616,0.782224,0.576223,0.228299,1.0


#### From the statistical summary of the dataset, we can deduce that...
* TV, radio and newspaper shows a strong correlation with the sales, with a correlation coefficient of 0.782224, 0.576223 and 0.228299

# 8. Data Duplication

In [13]:
# to check duplication value .duplicated() is used
df.duplicated().sum()

0

#### This indicates that there are no duplicate values present in the dataset.