# Linear Regression Challenge 3 - Air Pollution

It is winter time in **Delhi**, so Cody decided to go for a walk to the news stand. On reaching, he was surprised to see the **Air Quality Index (AQI)** of Delhi in the newspaper.  

He then collected **air samples** from different locations and, in his lab, extracted **five numeric features** from the samples. He combined these with the AQI values given in the newspapers.  

Your task: **Design a Machine Learning model** which, given the features extracted by Cody, can **predict AQI**.  


#### 📂 Dataset  

You are provided with **three CSV files**:  

- `Train.csv`  
- `Test.csv`  
- `Sample_Submission.csv`  


#### 📝 File Details  

##### 🔹 Train.csv  
- Contains **5 feature columns** (all numeric).  
- Contains **1 target column** (numeric AQI).  

##### 🔹 Test.csv  
- Contains the same **5 feature columns**.  
- ❌ Does **not** include the target column.  
- You must **predict target values** for this file.  

##### 🔹 Sample_Submission.csv  
- Shows the required **submission format**.  
- Must include **2 columns**:  
  - `id` → index of the test row  
  - `target` → predicted AQI  


#### Submission Checklist  

Before submitting, please ensure:  

- [ ] Column names **exactly match** those in `Sample_Submission.csv`  
- [ ] Data types of columns **match** sample submission  
- [ ] Number of rows = number of test cases  
- [ ] Number of columns = exactly 2 (`id`, `target`)  
- [ ] No extra spaces, no missing values  



# Data Preprocessing

In [33]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


In [34]:
#Load the dataset
data = pd.read_csv('.\Air quality dataset\Train.csv')
data = pd.DataFrame(data)

In [35]:
y = data.target
X = data.drop(columns=['target'])
X
 

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5
0,0.293416,-0.945599,-0.421105,0.406816,0.525662
1,-0.836084,-0.189228,-0.776403,-1.053831,0.597997
2,0.236425,0.132836,-0.147723,0.699854,-0.187364
3,0.175312,0.143194,-0.581111,-0.122107,-1.292168
4,-1.693011,0.542712,-2.798729,-0.686723,1.244077
...,...,...,...,...,...
1595,-0.274961,-0.820634,-0.757173,-0.147555,-0.307149
1596,-0.076099,0.255257,0.290054,1.796036,0.340350
1597,1.044177,-0.899206,1.730399,-1.871057,0.442520
1598,-1.269173,-0.005052,1.857669,-1.080365,0.736334


In [36]:
X.shape, y.shape

((1600, 5), (1600,))

# Using Scikit-Learn

In [37]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

In [38]:
#Normalization of data
scaler = StandardScaler()
X = scaler.fit_transform(X)
X 

array([[ 0.29016495, -0.89871183, -0.37238147,  0.44177059,  0.52502448],
       [-0.84270473, -0.15822922, -0.72365639, -0.99464217,  0.59896038],
       [ 0.23300381,  0.15706968, -0.10209444,  0.72994655, -0.20378187],
       ...,
       [ 1.0431652 , -0.8532941 ,  1.75476416, -1.79830858,  0.44004223],
       [-1.27708547,  0.02207793,  1.88059294, -1.0207355 ,  0.74035908],
       [-1.89374689, -0.80456069, -1.39187219,  0.52221049,  1.47960738]],
      shape=(1600, 5))

In [39]:
#Implementing Linear Regression 
model = LinearRegression()
model.fit(X, y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [40]:
print("Q1 to Q5 :", model.coef_ ,"\n Q0 : ", model.intercept_)

Q1 to Q5 : [29.59359198 94.65067706  8.37544469 45.52303635  2.46461552] 
 Q0 :  0.31883538441581594


In [41]:
#Test r2 score
print(model.score(X,y))

0.9660939669975617


In [42]:
# Predicting the values for test data
X_test = pd.read_csv('.\Air quality dataset\Test.csv')
X_test = pd.DataFrame(X_test)
X_test


Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5
0,1.015254,2.076209,-0.266435,-2.418088,-0.980503
1,-0.375021,0.953185,0.626719,0.704406,-0.355489
2,-1.024452,0.962991,-0.407942,-1.861274,0.455201
3,-2.489841,0.544802,0.601219,-0.607021,-1.314286
4,-0.384675,-0.833624,1.358552,-0.547932,0.411925
...,...,...,...,...,...
395,-0.436959,-0.575844,-1.620908,-0.222588,1.086013
396,-0.421324,-2.417543,0.876275,0.844565,0.171646
397,0.554728,1.768243,-0.897787,-1.193661,0.340563
398,-1.627172,0.856471,-0.000566,0.629387,0.453382


In [43]:
#Normalizing the test data
X_test = scaler.transform(X_test)

y_ = model.predict(X_test)
y_.shape

(400,)

In [44]:
#make y_ a dataframe with id and target as columns
y_ = pd.DataFrame(y_, columns=['target'])

# add id column to y_ dataframe at front
y_.insert(0, 'ID', range(0, len(y_)))
y_


Unnamed: 0,ID,target
0,0,114.583689
1,1,118.012815
2,2,-21.739852
3,3,-43.936899
4,4,-95.914898
...,...,...
395,395,-81.989000
396,396,-186.032535
397,397,125.292336
398,398,65.369841


In [None]:
#save y_ dataframe to csv file
#y_.to_csv('submission.csv', index=False)

# Using Brute Force......see housing prices code