# **Experiment No. 3**
# *Data Preprocessing, reading the dataset, handling missing data, conversion to tensor format*

## Data Preprocessing
> Data preprocessing is a crucial step in preparing data for analysis or machine learning. It involves tasks like reading datasets, handling missing data, and converting data to a suitable format, often tensors, for efficient computation. Here's how you can approach these tasks:

### Reading the dataset
> Reading a dataset is the first step. The choice of library depends on the dataset format (CSV, Excel, etc.). The pandas library is commonly used for reading structured data.

In [1]:
import pandas as pd
import numpy as np

data=pd.read_csv('/content/exp2dataset.csv')
data.head(5)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
0,165349.2,136897.8,471784.1,192261.83
1,162597.7,151377.59,443898.53,
2,,101145.55,407934.54,191050.39
3,144372.41,118671.85,383199.62,182901.99
4,142107.34,91391.77,,166187.94


### Handling Null values / missing data
>Handling null (missing) values is a critical part of data preprocessing, as missing data can negatively impact the quality of analysis or machine learning models. The pandas library in Python provides various methods to handle null values effectively. Here's a detailed explanation of common null value handling techniques in pandas

In [2]:
# Detecting null values

print(data.isnull().sum()) #gives total null values per column

R&D Spend          3
Administration     3
Marketing Spend    2
Profit             2
dtype: int64


In [3]:
# Removing null values

df=data.dropna()  #deletes rows with null value
print(df.isnull().sum())

R&D Spend          0
Administration     0
Marketing Spend    0
Profit             0
dtype: int64


In [4]:
# Imputing Null values with mean
imputed_data=data.fillna(data.mean())
print(imputed_data)


        R&D Spend  Administration  Marketing Spend     Profit
0   165349.200000   136897.800000    471784.100000  192261.83
1   162597.700000   151377.590000    443898.530000  149436.42
2   117116.659412   101145.550000    407934.540000  191050.39
3   144372.410000   118671.850000    383199.620000  182901.99
4   142107.340000    91391.770000    304411.417778  166187.94
5   131876.900000    99814.710000    362861.360000  156991.12
6   134615.460000   123552.015294    127716.820000  156122.51
7   130298.130000   145530.060000    323876.680000  155752.60
8   120542.520000   148718.950000    311613.290000  152211.77
9   117116.659412   108679.170000    304981.620000  149759.96
10  101913.080000   110594.110000    229160.950000  146121.95
11  100671.960000    91790.610000    249744.550000  144259.40
12   93863.750000   127320.380000    249839.440000  149436.42
13   91992.390000   123552.015294    252664.930000  134307.35
14  119943.240000   156547.420000    256512.920000  132602.65
15  1171

In [5]:
# Imputing Null values with specific value
imputed_data=data.fillna(0)
print(imputed_data)

    R&D Spend  Administration  Marketing Spend     Profit
0   165349.20       136897.80        471784.10  192261.83
1   162597.70       151377.59        443898.53       0.00
2        0.00       101145.55        407934.54  191050.39
3   144372.41       118671.85        383199.62  182901.99
4   142107.34        91391.77             0.00  166187.94
5   131876.90        99814.71        362861.36  156991.12
6   134615.46            0.00        127716.82  156122.51
7   130298.13       145530.06        323876.68  155752.60
8   120542.52       148718.95        311613.29  152211.77
9        0.00       108679.17        304981.62  149759.96
10  101913.08       110594.11        229160.95  146121.95
11  100671.96        91790.61        249744.55  144259.40
12   93863.75       127320.38        249839.44       0.00
13   91992.39            0.00        252664.93  134307.35
14  119943.24       156547.42        256512.92  132602.65
15       0.00       122616.84        261776.23  129917.04
16   78013.11 

In [6]:
# Forward fill
data_ffill = data.fillna(method='ffill')
print(data_ffill)

    R&D Spend  Administration  Marketing Spend     Profit
0   165349.20       136897.80        471784.10  192261.83
1   162597.70       151377.59        443898.53  192261.83
2   162597.70       101145.55        407934.54  191050.39
3   144372.41       118671.85        383199.62  182901.99
4   142107.34        91391.77        383199.62  166187.94
5   131876.90        99814.71        362861.36  156991.12
6   134615.46        99814.71        127716.82  156122.51
7   130298.13       145530.06        323876.68  155752.60
8   120542.52       148718.95        311613.29  152211.77
9   120542.52       108679.17        304981.62  149759.96
10  101913.08       110594.11        229160.95  146121.95
11  100671.96        91790.61        249744.55  144259.40
12   93863.75       127320.38        249839.44  144259.40
13   91992.39       127320.38        252664.93  134307.35
14  119943.24       156547.42        256512.92  132602.65
15  119943.24       122616.84        261776.23  129917.04
16   78013.11 

In [7]:
# Backward fill
data_bfill = data.fillna(method='bfill')
print(data_bfill)

    R&D Spend  Administration  Marketing Spend     Profit
0   165349.20       136897.80        471784.10  192261.83
1   162597.70       151377.59        443898.53  191050.39
2   144372.41       101145.55        407934.54  191050.39
3   144372.41       118671.85        383199.62  182901.99
4   142107.34        91391.77        362861.36  166187.94
5   131876.90        99814.71        362861.36  156991.12
6   134615.46       145530.06        127716.82  156122.51
7   130298.13       145530.06        323876.68  155752.60
8   120542.52       148718.95        311613.29  152211.77
9   101913.08       108679.17        304981.62  149759.96
10  101913.08       110594.11        229160.95  146121.95
11  100671.96        91790.61        249744.55  144259.40
12   93863.75       127320.38        249839.44  134307.35
13   91992.39       156547.42        252664.93  134307.35
14  119943.24       156547.42        256512.92  132602.65
15   78013.11       122616.84        261776.23  129917.04
16   78013.11 

In [13]:
# Creating a DataFrame with null values
data1 = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [6, np.nan, 8, np.nan, 10]
})
print(data1)

     A     B
0  1.0   6.0
1  2.0   NaN
2  NaN   8.0
3  4.0   NaN
4  5.0  10.0


In [14]:
# Interpolation

data_interpolated = data1.interpolate(method='linear')
print(data_interpolated)

     A     B
0  1.0   6.0
1  2.0   7.0
2  3.0   8.0
3  4.0   9.0
4  5.0  10.0


### Conversion to Tensor Format
> Tensors are fundamental data structures used for numerical computations, and they play a crucial role in various fields, particularly in machine learning and scientific computing. They are multi-dimensional arrays that can be used to represent data in a structured and efficient way. Here's a bit more detail about tensors and how they are used in different libraries

#### Numpy Tensors
>In the numpy library, tensors are implemented as numpy arrays. These arrays can have any number of dimensions and are used for various mathematical and numerical operations.

In [15]:
numpy_tensor=data_interpolated.to_numpy()
print(numpy_tensor)

[[ 1.  6.]
 [ 2.  7.]
 [ 3.  8.]
 [ 4.  9.]
 [ 5. 10.]]


#### Tensorflow Tensors
>TensorFlow library also uses tensors as the primary data structure for building and training machine learning models.

In [16]:
import tensorflow as tf

# Convert DataFrame to TensorFlow tensor
tf_tensor = tf.constant(data_interpolated.values)
print(tf_tensor)

tf.Tensor(
[[ 1.  6.]
 [ 2.  7.]
 [ 3.  8.]
 [ 4.  9.]
 [ 5. 10.]], shape=(5, 2), dtype=float64)
