### Perform various data preprocessing techniques like handling missing data and feature scaling.

#### step 1: Start by importing the necessary Python libraries for data preprocessing.


In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

#### Step 2: Load the placement dataset into a Pandas Dataframe.

In [17]:
df=pd.read_csv("Automobile.csv")
df.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,18.0,8.0,307.0,130.0,3504.0,12.0,70,usa
1,buick skylark 320,15.0,8.0,350.0,165.0,3693.0,11.5,70,usa
2,plymouth satellite,18.0,8.0,318.0,150.0,3436.0,11.0,70,usa
3,amc rebel sst,16.0,8.0,304.0,150.0,3433.0,12.0,70,usa
4,ford torino,17.0,,302.0,140.0,3449.0,10.5,70,usa


#### Step 3:Take a quick look at the data to understand its structure and identify any missing values or anomalies.

In [18]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          398 non-null    object 
 1   mpg           398 non-null    float64
 2   cylinders     395 non-null    float64
 3   displacement  395 non-null    float64
 4   horsepower    386 non-null    float64
 5   weight        396 non-null    float64
 6   acceleration  395 non-null    float64
 7   model_year    398 non-null    int64  
 8   origin        398 non-null    object 
dtypes: float64(6), int64(1), object(2)
memory usage: 28.1+ KB


(398, 9)

#### The method isnull() checks each element in the DataFrame (or Series) to see if it is NaN (Not a Number) or None (missing value).
It returns a DataFrame (or Series) of the same shape as the input, with Boolean values:
#### True: The value is null (NaN or None).
#### False: The value is not null.

In [19]:
df.isnull().sum()

name             0
mpg              0
cylinders        3
displacement     3
horsepower      12
weight           2
acceleration     3
model_year       0
origin           0
dtype: int64

#### Step 4: Handle Missing Data
#### Option 1: If the dataset is large and only a small percentage of data is missing, you can remove rows with missing values using dropna(subset,inplace)


In [20]:
df.dropna(subset=["horsepower"], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 386 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          386 non-null    object 
 1   mpg           386 non-null    float64
 2   cylinders     383 non-null    float64
 3   displacement  383 non-null    float64
 4   horsepower    386 non-null    float64
 5   weight        384 non-null    float64
 6   acceleration  384 non-null    float64
 7   model_year    386 non-null    int64  
 8   origin        386 non-null    object 
dtypes: float64(6), int64(1), object(2)
memory usage: 30.2+ KB


#### Option 2:If removing data isn't ideal, you can impute (df.[""].fillna(df[""].mean(),inplace)) missing values using methods like mean, median, or most frequent.

In [21]:
df["cylinders"].fillna(df["displacement"].mean(), inplace=True)
df["weight"].fillna(df["acceleration"].mean(), inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 386 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          386 non-null    object 
 1   mpg           386 non-null    float64
 2   cylinders     386 non-null    float64
 3   displacement  383 non-null    float64
 4   horsepower    386 non-null    float64
 5   weight        386 non-null    float64
 6   acceleration  384 non-null    float64
 7   model_year    386 non-null    int64  
 8   origin        386 non-null    object 
dtypes: float64(6), int64(1), object(2)
memory usage: 30.2+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["cylinders"].fillna(df["displacement"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["weight"].fillna(df["acceleration"].mean(), inplace=True)


#### Step 5: Feature Scaling


<img src="https://i.postimg.cc/G21gMYnF/f.png" alt="Image Description" width="500">









 Option 1: This method scales the data to have a mean of 0 and a standard deviation of 1.
### StandardScaler()

In [22]:
c=["cylinders","displacement","horsepower","weight"]
sc1=StandardScaler()
df[c]=sc1.fit_transform(df[c])
df.head()


Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,18.0,0.065096,1.087102,0.675237,0.632412,12.0,70,usa
1,buick skylark 320,15.0,0.065096,1.499934,1.595396,0.850166,11.5,70,usa
2,plymouth satellite,18.0,0.065096,1.19271,1.201042,0.554067,11.0,70,usa
3,amc rebel sst,16.0,0.065096,1.0583,1.201042,0.55061,12.0,70,usa
4,ford torino,17.0,11.240292,1.039098,0.93814,0.569045,10.5,70,usa


#### Option 2:This method scales the data to a fixed range, usually between 0 and 1. 
###  MinMaxScaler()

In [25]:
d=["cylinders","displacement","horsepower","weight"]
sc2=MinMaxScaler()
df[d]=sc2.fit_transform(df[d])
df.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,18.0,0.02621,0.617571,0.456522,0.680747,12.0,70,usa
1,buick skylark 320,15.0,0.02621,0.728682,0.646739,0.717629,11.5,70,usa
2,plymouth satellite,18.0,0.02621,0.645995,0.565217,0.667477,11.0,70,usa
3,amc rebel sst,16.0,0.02621,0.609819,0.565217,0.666892,12.0,70,usa
4,ford torino,17.0,1.0,0.604651,0.51087,0.670014,10.5,70,usa


####  Step 6:Separate the dataset into features (X) and target (y) variables. The target is usually the column you want to predict.

In [26]:
x=df[["mpg", "cylinders","displacement","weight","acceleration","model_year"]]
y=df["horsepower"]
y.head()

0    0.456522
1    0.646739
2    0.565217
3    0.565217
4    0.510870
Name: horsepower, dtype: float64


### Step 7: After preprocessing, save the cleaned and scaled dataset to a new CSV file


In [30]:
final=pd.concat([x,y],axis=1)
final.to_csv("Automobile.csv",index=False)

In [12]:
# Lab-1 Activities

#Perform data preprocesing for Automobile.csv

#i. Delete the column horsepower since it has few missing values

#ii. Impute missing with meadin

#iii. Apply min-max scaling and standardization on the Automobiles.csv and provide the reasoning which feature scaling method make more sense to this dataset.