Here's the updated tutorial code with detailed explanations for each step, and the final code also outputs the preprocessed dataset at the end.

### Step-by-Step Tutorial on Preprocessing

This tutorial walks through data preprocessing using the 'AirQualityUCI.csv' dataset, covering missing value handling, scaling, encoding, feature engineering, and visualization.

### Step 1: Import Libraries and Load the Dataset
We start by importing essential libraries for data manipulation, preprocessing, and visualization, then load the dataset.


In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv('AirQualityUCI.csv')

# View the first few rows
print("Initial dataset:")
print(data.head())

# Get information about the dataset
print("\nDataset Info:")
print(data.info())

# Check for missing values
print("\nMissing values per column:")
print(data.isnull().sum())




**Explanation:**
- We use `pandas` for data handling, `numpy` for numerical operations, and `sklearn` for preprocessing.
- The dataset is loaded using `pd.read_csv()`.
- We inspect the first few rows with `data.head()` and the dataset’s structure with `data.info()`.
- We check for missing values using `data.isnull().sum()`.



### Step 2: Handling Missing and Invalid Values
Here, we replace missing values using the mean strategy and filter out invalid values (e.g., -200).


In [None]:
# Replace missing values using the mean (CO(GT) column as an example)
imputer = SimpleImputer(strategy='mean')
data['CO(GT)'] = imputer.fit_transform(data[['CO(GT)']])

# Drop rows with invalid values (-200)
data = data[(data != -200).all(axis=1)]

print("\nDataset after handling missing and invalid values:")
print(data.head())




**Explanation:**
- `SimpleImputer` replaces missing values in the 'CO(GT)' column with the column’s mean.
- We filter out rows where any column has the value `-200`, indicating invalid data.



### Step 3: Feature Scaling
Scale numerical features using `StandardScaler` (for normalization) and `MinMaxScaler`.


In [None]:
# Normalize the data using StandardScaler for CO, PT08.S1(CO), and C6H6(GT)
scaler = StandardScaler()
data[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)']] = scaler.fit_transform(data[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)']])

# Apply Min-Max scaling for T, RH, and AH columns
minmax_scaler = MinMaxScaler()
data[['T', 'RH', 'AH']] = minmax_scaler.fit_transform(data[['T', 'RH', 'AH']])

print("\nDataset after scaling:")
print(data.head())




**Explanation:**
- We normalize certain columns using `StandardScaler` to have mean 0 and variance 1, making them suitable for algorithms sensitive to the scale of input features.
- We apply `MinMaxScaler` to scale other columns between 0 and 1 for consistency.

### Step 4: Feature Engineering
We create new features based on existing ones for additional insights.



In [None]:

# Combine 'Date' and 'Time' into a single datetime column and extract new features
data['DateTime'] = pd.to_datetime(data['Date'] + ' ' + data['Time'])
data['Hour'] = data['DateTime'].dt.hour
data['Day'] = data['DateTime'].dt.day

print("\nDataset after feature engineering:")
print(data[['Date', 'Time', 'DateTime', 'Hour', 'Day']].head())




**Explanation:**
- We convert the 'Date' and 'Time' columns into a single `datetime` column using `pd.to_datetime()`.
- We then extract features like the hour and day from this combined `datetime` column.

### Step 5: Data Visualization
Visualize the distribution of 'CO(GT)' levels and the correlation between features.



In [None]:

# Visualize CO levels distribution
plt.figure(figsize=(10, 6))
sns.histplot(data['CO(GT)'], bins=30)
plt.title('CO Levels Distribution')
plt.show()

# Correlation heatmap to explore relationships between variables
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()




**Explanation:**
- A histogram (`sns.histplot`) shows the distribution of 'CO(GT)' levels.
- A correlation heatmap (`sns.heatmap`) displays relationships between features, highlighting which variables are positively or negatively correlated.

### Step 6: Saving and Displaying the Preprocessed Data
Finally, save the cleaned dataset and display the final version.



In [None]:
# Save the preprocessed data to a new CSV file
data.to_csv('AirQualityUCI_preprocessed.csv', index=False)

# Display the final dataset
print("\nFinal preprocessed dataset:")
print(data.head())




**Explanation:**
- We save the preprocessed dataset as `AirQualityUCI_preprocessed.csv`.
- We display the first few rows to confirm the preprocessing steps were applied correctly.

### Consolidated Code

Here's the complete executable code with all the above steps consolidated:



In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv('AirQualityUCI.csv', sep=';')

# Initial dataset exploration
print("Initial dataset:")
print(data.head())
print("\nDataset Info:")
print(data.info())
print("\nMissing values per column:")
print(data.isnull().sum())

# Handle missing and invalid values
imputer = SimpleImputer(strategy='mean')
data['CO(GT)'] = imputer.fit_transform(data[['CO(GT)']])
data = data[(data != -200).all(axis=1)]

print("\nDataset after handling missing and invalid values:")
print(data.head())

# Feature scaling
scaler = StandardScaler()
data[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)']] = scaler.fit_transform(data[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)']])
minmax_scaler = MinMaxScaler()
data[['T', 'RH', 'AH']] = minmax_scaler.fit_transform(data[['T', 'RH', 'AH']])

print("\nDataset after scaling:")
print(data.head())

# Feature engineering
data['DateTime'] = pd.to_datetime(data['Date'] + ' ' + data['Time'])
data['Hour'] = data['DateTime'].dt.hour
data['Day'] = data['DateTime'].dt.day

print("\nDataset after feature engineering:")
print(data[['Date', 'Time', 'DateTime', 'Hour', 'Day']].head())

# Data visualization
plt.figure(figsize=(10, 6))
sns.histplot(data['CO(GT)'], bins=30)
plt.title('CO Levels Distribution')
plt.show()

plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Save the preprocessed data and display final dataset
data.to_csv('AirQualityUCI_preprocessed.csv', index=False)
print("\nFinal preprocessed dataset:")
print(data.head())
