# ⚠️ **Important Disclaimer**

Do **not edit or delete** any of the **Markdown cells** (the ones containing the questions and instructions).

Only write your answers in the **code cells provided below each question**.  
This ensures consistency during our feedback process.

### Q1. Load and Explore the Dataset

Load the `AirQalityDataset.csv` file into a pandas DataFrame using the correct separator.

After loading the data:

1. Display basic information about the dataset.
2. Save the statistical description of the dataset into a separate variable.
3. Drop fully empty/unnamed columns, and rows
4. Use `type()` to print the type of that description variable.

In [1]:
# your Code Here
import pandas as pd
import numpy as np
np.random.seed(0)

In [2]:
df = pd.read_csv('AirQualityDataset.csv', sep= ';')

In [3]:
df = df.dropna(axis=0, how='all')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 9357 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   float64
 4   NMHC(GT)       9357 non-null   float64
 5   C6H6(GT)       9357 non-null   float64
 6   PT08.S2(NMHC)  9357 non-null   float64
 7   NOx(GT)        9357 non-null   float64
 8   PT08.S3(NOx)   9357 non-null   float64
 9   NO2(GT)        9357 non-null   float64
 10  PT08.S4(NO2)   9357 non-null   float64
 11  PT08.S5(O3)    9357 non-null   float64
 12  T              9357 non-null   float64
 13  RH             9357 non-null   float64
 14  AH             9357 non-null   float64
dtypes: float64(13), object(2)
memory usage: 1.1+ MB
None


In [5]:
description = df.describe()
print(description)

            CO(GT)  PT08.S1(CO)     NMHC(GT)     C6H6(GT)  PT08.S2(NMHC)  \
count  9357.000000  9357.000000  9357.000000  9357.000000    9357.000000   
mean    -34.207524  1048.990061  -159.090093     1.865683     894.595276   
std      77.657170   329.832710   139.789093    41.380206     342.333252   
min    -200.000000  -200.000000  -200.000000  -200.000000    -200.000000   
25%       0.600000   921.000000  -200.000000     4.000000     711.000000   
50%       1.500000  1053.000000  -200.000000     7.900000     895.000000   
75%       2.600000  1221.000000  -200.000000    13.600000    1105.000000   
max      11.900000  2040.000000  1189.000000    63.700000    2214.000000   

           NOx(GT)  PT08.S3(NOx)      NO2(GT)  PT08.S4(NO2)  PT08.S5(O3)  \
count  9357.000000   9357.000000  9357.000000   9357.000000  9357.000000   
mean    168.616971    794.990168    58.148873   1391.479641   975.072032   
std     257.433866    321.993552   126.940455    467.210125   456.938184   
min    -200

In [6]:
print(type(description))

<class 'pandas.core.frame.DataFrame'>


### Q2. Dataset structure and features overview  
Write a code to collect:
1. The number of rows and columns in the dataset.
2. The list of first 10 feature columns excluding `'Date'` and `'Time'`.  

Store both lists in tuple called `dataset_info` and print it.

In [7]:
# your Code Here
dimensions = df.shape
Features = [ f for f in df.columns if f not in ['Date', 'Time']][:10]

In [8]:
dataset_inf0 = ((dimensions), list(Features))
print(dataset_inf0)

((9357, 15), ['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)'])


### Q3. CO(GT) summary with Pandas and NumPy
Compute the **mean** and **standard deviation** of `CO(GT)` using both:
- Pandas
- NumPy

In [9]:
# your Code Here
df['CO(GT)'] = pd.to_numeric(df['CO(GT)'], errors='coerce')
df['CO(GT)'] = df['CO(GT)'].replace(-200, np.nan)
df['CO(GT)'] = df['CO(GT)'] +1

In [10]:
mean_pandas = df['CO(GT)'].mean()
std_pandas = df['CO(GT)'].std()
print('Mean of CO(GT) in pandas:', mean_pandas, '\nStandard deviation of CO(GT) in pandas:', std_pandas)

Mean of CO(GT) in pandas: 3.152749543914516 
Standard deviation of CO(GT) in pandas: 1.4532520363373485


In [11]:
mean_numpy = np.mean(df['CO(GT)'])
std_numpy = np.std(df['CO(GT)'])
print('Mean of CO(GT) in numpy:', mean_numpy, '\nStandard deviation of CO(GT) in numpy:', std_numpy)

Mean of CO(GT) in numpy: 3.152749543914516 
Standard deviation of CO(GT) in numpy: 1.4531573465156917


### Q4. Absolute humidity (AH) distribution  
Compute the **min**, **median**, and **max** of `AH` using Pandas.  

Do you notice an issue in the values?  
If you think that there are values that are problematic, replace them with the median of the column and print the same three statistics after that.


In [12]:
# your code here
AH_min = df['AH'].min()
AH_median = df['AH'].median()
AH_max = df['AH'].max()
print('The minimum value of AH is:', AH_min, '\nThe median value of AH is:', AH_median, '\nThe maximum value of AH is:', AH_max)

The minimum value of AH is: -200.0 
The median value of AH is: 0.9768 
The maximum value of AH is: 2.231


In [13]:
df.loc[df['AH'] < 0, 'AH'] = AH_median

In [14]:
ah_min = df['AH'].min()
ah_median = df['AH'].median() 
ah_max = df['AH'].max()

print('The minimum value of AH is:', ah_min, '\nThe median value of AH is:', ah_median, '\nThe maximum value of AH is:', ah_max)

The minimum value of AH is: 0.1847 
The median value of AH is: 0.9768 
The maximum value of AH is: 2.231


### Q5. Humidity bands
Create a new column `humidity_band` using `RH`:
- `'dry'` if `RH < 30`
- `'comfortable'` if `30 <= RH <= 60`
- `'humid'` if `RH > 60`

Then show the **count** of each category.

In [15]:
# your Code Here
def Humidity_band(RH):
    if RH > 60:
        return 'Humid'
    elif 30 <= RH <= 60:
        return 'Comfortable'
    else:
        return 'Dry'

df['Humidity_band'] = df['RH'].apply(Humidity_band)
print(df['Humidity_band'].value_counts())

Humidity_band
Comfortable    4929
Humid          2633
Dry            1795
Name: count, dtype: int64


### Q6. Compute the Average 'CO(GT)' for Humid Conditions  

Using the `'humidity_band'` column created above, filter the dataset for rows labeled `'humid'` and compute the **average value of `'CO(GT)'`** for these observations.  

Format the output to 4 decimal places for better readability and precision.

In [16]:
# your Code Here
df['CO(GT)'] = pd.to_numeric(df['CO(GT)'], errors='coerce')
df['CO(GT)'] = df['CO(GT)'].replace(-200, np.nan)

Humidity_rows = df[df['Humidity_band'] == 'Humid']
Humidity_avg = Humidity_rows['CO(GT)'].mean()
print(f"The Average 'CO(GT)' for Humid Conditions: {Humidity_avg:.4f}")

The Average 'CO(GT)' for Humid Conditions: 3.2323


### Q7. Retrieve and sort array by a specific column
Create a NumPy array from the columns `[T, RH, AH]` (in this order), then sort the array by the **third column (`AH`)** ascending. Show the first 5 rows.

In [17]:
# your Code Here
for col in ['T', 'RH', 'AH']:
    df[col] = pd.to_numeric(df[col], errors='coerce')
    df[col] = df[col].replace(-200, np.nan)
    df[col] = df[col].fillna(df[col].median())
    

data = df[['T', 'RH', 'AH']].to_numpy()
sort_index = 0
data_sorted = data[data[:, 2].argsort()]
print(data_sorted[:5])

[[ 0.     29.7     0.1847]
 [11.8    13.5     0.1862]
 [ 0.2    30.2     0.191 ]
 [-0.1    31.9     0.1975]
 [12.2    14.      0.1988]]


### Q8. Normalized moisture index

Using the NumPy array you built above (**Do not change it**):  

1. Using Numpy, **Convert RH to a fraction** (0–1 scale) by dividing it by 100 and save it to another array `RH_frac`.
2. Using Numpy, **Compute a normalized moisture index** by dividing `AH` by `RH_frac`. This almost computes the amount of absolute humidity per unit of relative humidity.

Print the first 10 values of this new array and then **store** the result in the original DataFrame as a new column `'moisture_index'`.

In [18]:
# your Code Here
RH_frac = data[:, 1] / 100

moisture_index = data[:, 2] / RH_frac

print(moisture_index[:10])

df['moisture_index'] = moisture_index

[1.54969325 1.52096436 1.38925926 1.31116667 1.32348993 1.32567568
 1.33855634 1.28366667 1.28107203 1.2486711 ]


### Q9. Temperature profile for high moisture index  

Using Numpy only, and the `moisture_index` values you computed earlier:  

1. Find the **median** of `moisture_index`.  
2. Filter `tri_array` to include only rows where `moisture_index` is above this median.  
3. Compute and print the **mean temperature** for this high-moisture group using only NumPy.

Format the output to 4 decimal places for better readability and precision.

In [19]:
# your Code Here
tri_data = data
median = np.median(moisture_index)
mask = moisture_index > median
mean_temperature = np.mean(tri_data[mask, 0])

print('The median of moisture_index is:', median)
print(f'mean temperature:{mean_temperature:.4f}')

The median of moisture_index is: 1.9693548387096775
mean temperature:25.3409


### Q10. Percentile-based filtering
Compute:
- the **85th percentile** of `C6H6(GT)` (benzene), and
- the **25th percentile** of `RH`.

Filter and return rows where `C6H6(GT)` is **above** its 85th percentile **and** `RH` is **below** its 25th percentile. Show the number of rows and the first 5 matches.


In [20]:
# your Code Here
df['C6H6(GT)'] = pd.to_numeric(df['C6H6(GT)'], errors='coerce').replace(-200, np.nan)
df['RH'] = pd.to_numeric(df['RH'], errors='coerce').replace(-200, np.nan)


In [21]:
p85 = np.percentile(df['C6H6(GT)'], 85)
p25 = np.percentile(df['RH'], 25)

cond = (df['C6H6(GT)'] > p25) & ( df['RH'] < p25)
filtered_rows = df[cond]

print(filtered_rows.shape[0], filtered_rows.head())

4             Date      Time  CO(GT)  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  \
169   17/03/2004  19.00.00     8.6       1973.0     577.0      38.4   
1839  26/05/2004  09.00.00     5.0       1303.0    -200.0      40.2   
4848  28/09/2004  18.00.00     7.7       1432.0    -200.0      36.9   
4849  28/09/2004  19.00.00     8.5       1479.0    -200.0      39.3   

      PT08.S2(NMHC)  NOx(GT)  PT08.S3(NOx)  NO2(GT)  PT08.S4(NO2)  \
169          1737.0    411.0         617.0    194.0        2414.0   
1839         1776.0   -200.0         946.0   -200.0        1467.0   
4848         1706.0    702.0         488.0    204.0        2098.0   
4849         1756.0    755.0         466.0    209.0        2165.0   

      PT08.S5(O3)     T    RH      AH Humidity_band  moisture_index  
169        2306.0  23.1  26.5  0.7403           Dry        2.793585  
1839       1608.0  20.4  35.7  0.8440   Comfortable        2.364146  
4848       1842.0  24.7  27.9  0.8559           Dry        3.067742  
4849       1908.

### Q11. Simulate Sensor Measurement Noise and Analyze the Effect  

Simulate **normally distributed measurement noise** with a mean of `0` and a standard deviation of `100` (in raw sensor units). Then:  

- Use **NumPy** to generate the noise.  
- Use **Pandas** to add this noise to the `'PT08.S1(CO)'` column and store the result in a new column `'PT08.S1_noisy'`.  
- Print the **mean** and **standard deviation** of both `'PT08.S1(CO)'` and `'PT08.S1_noisy'` to observe the impact of the simulated noise.  

Observe how the added noise affects the distribution, particularly the spread (**standard deviation**). Format all printed values to **4 decimal places** using `.4f`.  


In [22]:
# your Code Here
noise = np.random.normal(0, 100, size=df.shape[0])

df['PT08.S1_noisy'] = df['PT08.S1(CO)'] + noise

mean_CO = df['PT08.S1(CO)'].mean()
std_CO = df['PT08.S1(CO)'].std()

print(f'mean of PT08.S1(CO) is:{mean_CO:.4f}', f'\nStandard deviation of PT08.S1(CO) is:{std_CO:.4f}' )

mean of PT08.S1(CO) is:1048.9901 
Standard deviation of PT08.S1(CO) is:329.8327


In [23]:
mean_noisy = df['PT08.S1_noisy'].mean()
std_noisy = df['PT08.S1_noisy'].std()

print(f'mean of PT08.S1_noisy is:{mean_noisy:.4f}', f'\nStandard deviation of PT08.S1_noisy is:{std_noisy:.4f}' )

mean of PT08.S1_noisy is:1046.8703 
Standard deviation of PT08.S1_noisy is:345.1911


# Make Your Results Reproducible

If you re-run the previous cell multiple times, you'll notice that the results involving randomness (e.g., simulated noise) change each time. This is because NumPy generates new random numbers on every execution.

To make your results **reproducible** (meaning that both you and your instructor get the **same output every time**) you need to set a fixed **random seed**.

As the final task, go back and add the following line to your code **immediately after importing NumPy** for the first time in your notebook:

```python
np.random.seed(0)


So, your NumPy import at the top of the notebook should now look like this:

```python
import numpy as np
np.random.seed(0)


# After Making This Change:

- Re-run **all cells** in the notebook from top to bottom.  
- Make sure **all outputs are visible**.  
- **Save your notebook.**  
- **Submit it as-is (with all outputs included.)**
