### Problem: Cost-Effective Detection of Nitrogen Dioxide (NO2) for Air Quality Management

#### Problem Statement:
Detecting nitrogen dioxide (NO2) levels in the atmosphere is essential for effective air quality management. However, traditional detection methods often involve expensive equipment and complex procedures, limiting widespread monitoring efforts, particularly in resource-constrained environments.

#### Impact of Solving the Problem:

1. **Improved Public Health**:
   - Accurate and affordable NO2 detection enables timely interventions to reduce exposure levels, mitigating respiratory issues and chronic diseases associated with NO2 pollution.
   - Early detection and response to elevated NO2 levels can prevent adverse health outcomes and reduce healthcare costs.

2. **Environmental Protection**:
   - Cost-effective NO2 monitoring facilitates the identification of pollution hotspots and sources, enabling targeted interventions to reduce emissions and mitigate environmental damage.
   - Effective NO2 detection supports efforts to combat climate change by reducing greenhouse gas emissions and minimizing the formation of secondary pollutants like ground-level ozone and fine particulate matter.

3. **Policy and Regulatory Impact**:
   - Accessible NO2 monitoring data empowers policymakers and regulatory agencies to develop evidence-based air quality management strategies and set emission standards that protect public health and the environment.
   - Cost-effective detection methods promote broader compliance with air quality regulations by lowering the barriers to monitoring and enforcement for businesses and industries.

4. **Social and Economic Benefits**:
   - Enhanced air quality resulting from improved NO2 detection contributes to a healthier and more productive population, reducing absenteeism and healthcare expenditures.
   - Investments in cost-effective NO2 monitoring technologies stimulate innovation and economic growth in the environmental monitoring sector, creating job opportunities and fostering technological advancements.

By addressing the challenge of cost-effective NO2 detection, we can achieve significant positive impacts on public health, environmental sustainability, regulatory effectiveness, and socioeconomic development. This underscores the importance of advancing accessible and reliable monitoring solutions to support effective air quality management worldwide.

**Dataset** : [Air Quality](https://archive.ics.uci.edu/dataset/360/air+quality)

**Linkedin** : [Mubashir Iqbal](https://www.linkedin.com/in/-mubashir-iqbal/)

In [4]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_excel("AirQualityUCI.xlsx")
df.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,2004-03-10,18:00:00,2.6,1360.0,150,11.881723,1045.5,166.0,1056.25,113.0,1692.0,1267.5,13.6,48.875001,0.757754
1,2004-03-10,19:00:00,2.0,1292.25,112,9.397165,954.75,103.0,1173.75,92.0,1558.75,972.25,13.3,47.7,0.725487
2,2004-03-10,20:00:00,2.2,1402.0,88,8.997817,939.25,131.0,1140.0,114.0,1554.5,1074.0,11.9,53.975,0.750239
3,2004-03-10,21:00:00,2.2,1375.5,80,9.228796,948.25,172.0,1092.0,122.0,1583.75,1203.25,11.0,60.0,0.786713
4,2004-03-10,22:00:00,1.6,1272.25,51,6.518224,835.5,131.0,1205.0,116.0,1490.0,1110.0,11.15,59.575001,0.788794


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Date           9357 non-null   datetime64[ns]
 1   Time           9357 non-null   object        
 2   CO(GT)         9357 non-null   float64       
 3   PT08.S1(CO)    9357 non-null   float64       
 4   NMHC(GT)       9357 non-null   int64         
 5   C6H6(GT)       9357 non-null   float64       
 6   PT08.S2(NMHC)  9357 non-null   float64       
 7   NOx(GT)        9357 non-null   float64       
 8   PT08.S3(NOx)   9357 non-null   float64       
 9   NO2(GT)        9357 non-null   float64       
 10  PT08.S4(NO2)   9357 non-null   float64       
 11  PT08.S5(O3)    9357 non-null   float64       
 12  T              9357 non-null   float64       
 13  RH             9357 non-null   float64       
 14  AH             9357 non-null   float64       
dtypes: datetime64[ns](1),

In [8]:
df.isnull().sum()

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

In [10]:
df.describe()

Unnamed: 0,Date,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
count,9357,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0,9357.0
mean,2004-09-21 04:30:05.193972480,-34.207524,1048.869652,-159.090093,1.865576,894.475963,168.6042,794.872333,58.135898,1391.363266,974.951534,9.7766,39.483611,-6.837604
min,2004-03-10 00:00:00,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0
25%,2004-06-16 00:00:00,0.6,921.0,-200.0,4.004958,711.0,50.0,637.0,53.0,1184.75,699.75,10.95,34.05,0.692275
50%,2004-09-21 00:00:00,1.5,1052.5,-200.0,7.886653,894.5,141.0,794.25,96.0,1445.5,942.0,17.2,48.55,0.976823
75%,2004-12-28 00:00:00,2.6,1221.25,-200.0,13.636091,1104.75,284.2,960.25,133.0,1662.0,1255.25,24.075,61.875,1.296223
max,2005-04-04 00:00:00,11.9,2039.75,1189.0,63.741476,2214.0,1479.0,2682.75,339.7,2775.0,2522.75,44.6,88.725,2.231036
std,,77.65717,329.817015,139.789093,41.380154,342.315902,257.424561,321.977031,126.931428,467.192382,456.922728,43.203438,51.215645,38.97667


In [11]:
df.describe(include='object')

Unnamed: 0,Time
count,9357
unique,24
top,18:00:00
freq,390


In [14]:
df.drop(columns='Time',inplace=True)
df.sample()

Unnamed: 0,Date,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
3417,2004-07-31,-200.0,929.5,-200,5.161405,771.5,-200.0,854.0,-200.0,1613.25,728.25,23.7,56.524999,1.634142


In [30]:
df['Day']= df['Date'].dt.day
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
df[['Date','Day','Month','Year']]

Unnamed: 0,Date,Day,Month,Year
0,2004-03-10,10,3,2004
1,2004-03-10,10,3,2004
2,2004-03-10,10,3,2004
3,2004-03-10,10,3,2004
4,2004-03-10,10,3,2004
...,...,...,...,...
9352,2005-04-04,4,4,2005
9353,2005-04-04,4,4,2005
9354,2005-04-04,4,4,2005
9355,2005-04-04,4,4,2005


In [37]:
df1 = df.drop(columns = "Date")
df1 = df1[['Day', 'Month', 'Year','CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'PT08.S4(NO2)', 'PT08.S5(O3)',
       'T', 'RH', 'AH','NO2(GT)']]
df1.head()

Unnamed: 0,Day,Month,Year,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,NO2(GT)
0,10,3,2004,2.6,1360.0,150,11.881723,1045.5,166.0,1056.25,1692.0,1267.5,13.6,48.875001,0.757754,113.0
1,10,3,2004,2.0,1292.25,112,9.397165,954.75,103.0,1173.75,1558.75,972.25,13.3,47.7,0.725487,92.0
2,10,3,2004,2.2,1402.0,88,8.997817,939.25,131.0,1140.0,1554.5,1074.0,11.9,53.975,0.750239,114.0
3,10,3,2004,2.2,1375.5,80,9.228796,948.25,172.0,1092.0,1583.75,1203.25,11.0,60.0,0.786713,122.0
4,10,3,2004,1.6,1272.25,51,6.518224,835.5,131.0,1205.0,1490.0,1110.0,11.15,59.575001,0.788794,116.0


In [38]:
df1.drop(columns = ['Day', 'Month', 'Year'],inplace =True)

In [40]:
X = df1.drop(columns = "NO2(GT)")
y = df1['NO2(GT)']

In [42]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.15,random_state=42)


# Linear Regression

### Ridge Regularization

In [43]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X_train_scaled = scale.fit_transform(X_train)
X_test_scaled = scale.transform(X_test)

In [115]:
# Dataframe to save records of different experiments
results_df = pd.DataFrame(columns=[ 'iterations','alpha', 'l1_ratio', 'RMSE', 'R2 Score'])

In [141]:
from sklearn.linear_model  import SGDRegressor
#Ridge Regression Only
model = SGDRegressor(max_iter= 500, alpha=0.002,l1_ratio=1)

In [142]:
model.fit(X_train_scaled,y_train)

In [143]:
from sklearn.metrics import r2_score,mean_squared_error
pred = model.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test,pred))
print("RMSE = ",rmse)
r2 = r2_score(y_test,pred)
print("R2 Score = ",r2)

RMSE =  54.69269580780921
R2 Score =  0.8048129969802476


In [144]:
result = pd.DataFrame({
        'alpha': model.alpha,
        'l1_ratio': model.l1_ratio,
        'RMSE': rmse,
        'R2 Score': r2,
        'iterations':model.max_iter
    },index=[0])
results_df = pd.concat([results_df,result],axis=0,ignore_index=True)

In [147]:
results_df

Unnamed: 0,iterations,alpha,l1_ratio,RMSE,R2 Score,Regularization
0,100,0.0001,1,54.647796,0.805133,ridge
1,200,0.0001,1,54.520177,0.806042,ridge
2,500,0.0001,1,54.358886,0.807188,ridge
3,2000,0.0001,1,55.141826,0.801594,ridge
4,2000,0.002,1,55.0728,0.802091,ridge
5,500,0.002,1,54.692696,0.804813,ridge


In [146]:
results_df['Regularization']='ridge'

### Lasso Regresion

In [153]:
# Dataframe to save records of different experiments
results_lasso = pd.DataFrame(columns=[ 'iterations','alpha', 'l1_ratio', 'RMSE', 'R2 Score'])

In [172]:
model2 = SGDRegressor(max_iter= 5000, alpha=0.0001,l1_ratio=0)
model2.fit(X_train_scaled,y_train)
pred = model2.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test,pred))
print("RMSE = ",rmse)
r2 = r2_score(y_test,pred)
print("R2 Score = ",r2)

RMSE =  54.922080673762395
R2 Score =  0.8031723086126391


In [173]:
result = pd.DataFrame({
        'alpha': model2.alpha,
        'l1_ratio': model2.l1_ratio,
        'RMSE': rmse,
        'R2 Score': r2,
        'iterations':model2.max_iter
    },index=[0])
results_lasso = pd.concat([results_lasso,result],axis=0,ignore_index=True)

In [174]:
results_lasso

Unnamed: 0,iterations,alpha,l1_ratio,RMSE,R2 Score
0,500,0.002,0,55.263279,0.800719
1,500,0.02,0,55.677837,0.797718
2,500,0.08,0,57.647821,0.783151
3,500,0.1,0,57.954143,0.78084
4,500,0.0001,0,54.843091,0.803738
5,1000,0.0001,0,54.477565,0.806345
6,5000,0.0001,0,54.922081,0.803172


In [175]:
results_lasso['Regularization'] = 'lasso'

### ElasticNet Regresion

In [197]:
# Dataframe to save records of different experiments
results_elastic = pd.DataFrame(columns=[ 'iterations','alpha', 'l1_ratio', 'RMSE', 'R2 Score'])

In [250]:
model3 = SGDRegressor(max_iter= 500, alpha=0.009,l1_ratio=0.6)
model3.fit(X_train_scaled,y_train)
pred = model3.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test,pred))
print("RMSE = ",rmse)
r2 = r2_score(y_test,pred)
print("R2 Score = ",r2)

RMSE =  55.1965097802469
R2 Score =  0.8012004169515249


In [251]:
result = pd.DataFrame({
        'alpha': model3.alpha,
        'l1_ratio': model3.l1_ratio,
        'RMSE': rmse,
        'R2 Score': r2,
        'iterations':model3.max_iter
    },index=[0])
results_elastic = pd.concat([results_elastic,result],axis=0,ignore_index=True)

In [252]:
results_elastic

Unnamed: 0,iterations,alpha,l1_ratio,RMSE,R2 Score
0,5000,0.001,0.5,54.746036,0.804432
1,5000,0.001,0.9,54.908884,0.803267
2,5000,0.001,0.3,55.213567,0.801078
3,5000,0.01,0.5,55.353534,0.800068
4,5000,0.001,0.4,55.170463,0.801388
5,5000,0.0055,0.7,55.206163,0.801131
6,500,0.009,0.6,55.19651,0.8012


In [253]:
results_elastic['Regularization']='elastic'

In [258]:
results_df = pd.concat([results_df,results_lasso,results_elastic],axis=0,ignore_index=True)
results_df = results_df.sort_values(['RMSE','R2 Score'],ascending=[True,False])
results_df.head()

Unnamed: 0,iterations,alpha,l1_ratio,RMSE,R2 Score,Regularization
0,500,0.0001,1,54.358886,0.807188,ridge
1,1000,0.0001,0,54.477565,0.806345,lasso
2,1000,0.0001,0,54.477565,0.806345,lasso
3,1000,0.0001,0,54.477565,0.806345,lasso
53,1000,0.0001,0,54.477565,0.806345,lasso


In [259]:
results_df = results_df.sort_values(['R2 Score','RMSE'],ascending=[False,True])
results_df.head()

Unnamed: 0,iterations,alpha,l1_ratio,RMSE,R2 Score,Regularization
0,500,0.0001,1,54.358886,0.807188,ridge
1,1000,0.0001,0,54.477565,0.806345,lasso
2,1000,0.0001,0,54.477565,0.806345,lasso
3,1000,0.0001,0,54.477565,0.806345,lasso
53,1000,0.0001,0,54.477565,0.806345,lasso


Based on our evaluations, the Ridge regression model has demonstrated the best performance for this problem, achieving the lowest RMSE and the highest R2 score. Given these results, we will proceed to train the final model using Ridge regression.

In [291]:
final_model =  SGDRegressor(max_iter=500,alpha = 0.0001,l1_ratio=1)
final_model.fit(X_train_scaled,y_train)
pred = final_model.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test,pred))
print("RMSE = ",rmse)
r2 = r2_score(y_test,pred)
print("R2 Score = ",r2)

RMSE =  54.27958509933857
R2 Score =  0.8077504757001064


## Polynomial Regression

In [293]:
from sklearn.preprocessing import PolynomialFeatures
# random_state = 42
polynomial = pd.DataFrame( columns = ['Degree','Iterations','Regularization','Alpha','l1_ratio'])

In [368]:
add_degree = PolynomialFeatures(degree=2)
Xtrain_poly = add_degree.fit_transform(X_train)
Xtest_poly = add_degree.transform(X_test)
scale = StandardScaler()
Xtrain_poly = scale.fit_transform(Xtrain_poly)
Xtest_poly = scale.transform(Xtest_poly)

In [380]:
model =  SGDRegressor(max_iter=50,alpha = 0.005,l1_ratio=0.38,random_state=42)
model.regularization = 'elastic'
model.fit(Xtrain_poly,y_train)
pred = model.predict(Xtest_poly)
rmse = np.sqrt(mean_squared_error(y_test,pred))
print("RMSE = ",rmse)
r2 = r2_score(y_test,pred)
print("R2 Score = ",r2)

experiment = pd.DataFrame( {
    'Degree':add_degree.degree,
    'Iterations' : model.max_iter,
    'Regularization' : model.regularization,
    'Alpha':model.alpha,
    'l1_ratio' : model.l1_ratio,
    'r2_score' : r2,
    'rmse' : rmse
},index = [0])
polynomial = pd.concat([polynomial,experiment],axis=0)

RMSE =  27.173236836078033
R2 Score =  0.9518190916783814


- **Have experimented with differnt parameters**
- **Results and hyperparameters couls be seen in the table below**

In [381]:
polynomial

Unnamed: 0,Degree,Iterations,Regularization,Alpha,l1_ratio,r2_score,rmse
0,2,500,ridge,0.0001,1.0,0.9536052,26.66482
0,3,500,ridge,0.0001,1.0,-1.545292e+17,48664150000.0
0,4,500,ridge,0.0001,1.0,-2.280715e+19,591206800000.0
0,2,500,ridge,0.001,1.0,0.953275,26.75954
0,2,500,ridge,0.005,1.0,0.9518191,27.17324
0,2,500,ridge,0.007,1.0,0.9497835,27.74133
0,2,500,lasso,0.007,0.0,0.9497835,27.74133
0,2,500,lasso,0.01,0.0,0.9489704,27.96502
0,2,500,lasso,0.0045,0.0,0.9519965,27.12316
0,2,500,lasso,0.1,0.0,0.9300855,32.73312


In [383]:
polynomial.sort_values(['r2_score','rmse'],ascending=[False,True]).head(2)

Unnamed: 0,Degree,Iterations,Regularization,Alpha,l1_ratio,r2_score,rmse
0,2,500,ridge,0.0001,1.0,0.953605,26.664819
0,2,500,elastic,0.0001,0.9,0.953605,26.664819


In [384]:
polynomial.sort_values(['rmse','r2_score'],ascending=[True,False]).head(2)

Unnamed: 0,Degree,Iterations,Regularization,Alpha,l1_ratio,r2_score,rmse
0,2,500,ridge,0.0001,1.0,0.953605,26.664819
0,2,500,elastic,0.0001,0.9,0.953605,26.664819


### Report Conclusion

After thorough experimentation with various combinations of linear and polynomial regression, we found that the polynomial regression model consistently outperformed the simple linear regression model in terms of both R² score and RMSE. Additionally, we conducted tests on different regularization techniques to ensure the model's robustness against overfitting.

The best-performing model, with the following hyperparameters, showcased remarkable results:

| Degree | Iterations | Regularization | Alpha | L1 Ratio | R² Score | RMSE      | Random state|
|--------|------------|----------------|-------|----------|----------|-----------|-----------|
| 2      | 500        | Ridge          | 0.0001| 1        | 0.953605 | 26.664819 |42 |

These findings underscore the effectiveness of utilizing polynomial features alongside proper regularization techniques to significantly enhance the model's capacity in capturing intricate patterns within the data. Leveraging this approach can profoundly aid in monitoring and mitigating NO2 concentrations in the air, enabling proactive measures to combat air pollution effectively.

---

<div align="center">
    <p style="font-family: cursive; font-size: 24px;">Thank You</p>
    <p style="font-family: cursive; font-size: 18px;">Mubashir Iqbal</p>
</div>
