# #01 Challenge | Delhi's Air Quality Data

We are starting the biweekly series of challenges in this Study Circle. After considering the topics you have suggested in the comments, we are kicking off with Time Series.

## Why this Data topic?

This morning, I read the Economist Espresso on [India's pollution season](https://espresso.economist.com/0ef63386fdcb3dc2c2914b319668ff81), and I thought it was a good idea to start the series of challenges with this topic.

## Getting the Data

After navigating many websites, such as India's [Central Pollution Control Board](https://cpcb.nic.in/National-Air-Quality-Index/) and WHO, I found [this website](https://aqicn.org/data-platform/register/) about Air Quality Data, where we can download the data from many places worldwide.

I chose Delhi to be the city we will analyse in this challenge.

Executing the following lines of code will produce the DataFrame we'll work with:

In [94]:
import pandas as pd

df = pd.read_csv('anand-vihar, delhi-air-quality.csv', parse_dates=['date'], index_col=0)
df

Unnamed: 0_level_0,pm25,pm10,o3,no2,so2,co
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-10-01,176,436,17,38,8,10
2022-10-02,171,344,17,43,4,15
...,...,...,...,...,...,...
2014-08-07,,,22,41,6,6
2021-07-31,,,,11,4,12


I needed to process the data to deliver a workable dataset in the following way:

In [95]:
#remove whitespaces in columns
df.columns = df.columns.str.strip()

#get the rows with the numbers (some of them where whitespaces)
series = df['pm25'].str.extract('(\w+)')[0]

#rolling average to armonize the data monthly
series_monthly = series.rolling(int((52*7)/12)).mean()

#remove missing dates
series_monthly = series_monthly.dropna()

#fill missing dates by linear interpolation
series_monthly = series_monthly.interpolate(method='linear')

#sorting the index to later make a reasonable plot
series_monthly = series_monthly.sort_index()

#aggregate the information by month
series_monthly = series_monthly.to_period('M').groupby(level='date').mean()

#process a timestamp to avoid errors with statsmodels' functions
series_monthly = series_monthly.to_timestamp()

#setting freq to avoid errors with statsmodels' functions
series_monthly = series_monthly.asfreq("MS").interpolate()

#change the name of the pandas.Series
series_monthly.name = 'air pollution pm25'

As we don't know the coding skills of this Study Circle member, we'll start with simple ARIMA models. From this point, we will iterate the procedure and improve the dynamic.

To take on the challenge and maybe, receive some feedback, you should fork this repository to your GitHub account. Otherwise, you can download this script.

The end goal is to develop an ARIMA model and plot the predictions against the actual data. Resulting in a plot like the following:

![](final_plot.jpg)

Nevertheless, you can develop this challenge in any way you find attractive. The essential point of this Study Circle is the interactivity between the members to generate value and knowledge.

From your feedback, we could later work on different use cases. For example, we could later create a geospatial map in Python with the predictions.

So, let's get on and good luck!

You start with the following object:

## Learning Materials

Check out the following materials to learn how you could develop the challenge:

- [Video Tutorial](https://www.youtube.com/watch?v=gqryqIlvEoM): How to develop ARIMA models to predict Stock Price

## Start the challenge

In [96]:
series_monthly

date
2014-01-01    286.023457
2014-02-01    281.428205
                 ...    
2022-08-01    115.487097
2022-09-01    143.713333
Freq: MS, Name: air pollution pm25, Length: 105, dtype: float64