# DATA ANALYSIS AND PREPROCESSING

## 1 Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')

## 2. Load the Datasets

In [3]:
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')

## 3. Data Inspection

We are going to perform several basic checks to get a high-level understanding of the data's structure, data types, and potential issues like missing values.

### 3.1 View the First Few Rows
We'll preview the first five rows of the training data.

In [4]:
print("Training Data Head:")
display(train_df.head())

Training Data Head:


Unnamed: 0,No,DEWP,TEMP,PRES,Iws,Is,Ir,datetime,cbwd_NW,cbwd_SE,cbwd_cv,pm2.5
0,1,-1.580878,-1.92225,0.443328,-0.441894,-0.069353,-0.137667,2010-01-01 00:00:00,1.448138,-0.732019,-0.522096,
1,2,-1.580878,-2.004228,0.345943,-0.379306,-0.069353,-0.137667,2010-01-01 01:00:00,1.448138,-0.732019,-0.522096,
2,3,-1.580878,-1.92225,0.248559,-0.343514,-0.069353,-0.137667,2010-01-01 02:00:00,1.448138,-0.732019,-0.522096,
3,4,-1.580878,-2.168183,0.248559,-0.280926,-0.069353,-0.137667,2010-01-01 03:00:00,1.448138,-0.732019,-0.522096,
4,5,-1.511594,-2.004228,0.151174,-0.218339,-0.069353,-0.137667,2010-01-01 04:00:00,1.448138,-0.732019,-0.522096,


**Observation:** The output shows a mix of numerical features (like `DEWP`, `TEMP`), categorical features that have been one-hot encoded (like `cbwd_NW`), and the target variable, `pm2.5`. We can immediately see `NaN` (Not a Number) values in the `pm2.5` column for the first few entries.

### 3.2Handle Datetime Column
A critical step we have to make in this time series analysis is to ensure the time column is in the correct `datetime` format and set as the index. This will enable powerful time-based slicing, plotting, and analysis.


In [5]:
# Convert the 'datetime' column in both dataframes
train_df['datetime'] = pd.to_datetime(train_df['datetime'])
test_df['datetime'] = pd.to_datetime(test_df['datetime'])

# Set datetime as the index
train_df.set_index('datetime', inplace=True)
test_df.set_index('datetime', inplace=True)

print("\nData types after setting datetime index:")
train_df.info()


Data types after setting datetime index:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 30676 entries, 2010-01-01 00:00:00 to 2013-07-02 03:00:00
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   No       30676 non-null  int64  
 1   DEWP     30676 non-null  float64
 2   TEMP     30676 non-null  float64
 3   PRES     30676 non-null  float64
 4   Iws      30676 non-null  float64
 5   Is       30676 non-null  float64
 6   Ir       30676 non-null  float64
 7   cbwd_NW  30676 non-null  float64
 8   cbwd_SE  30676 non-null  float64
 9   cbwd_cv  30676 non-null  float64
 10  pm2.5    28755 non-null  float64
dtypes: float64(10), int64(1)
memory usage: 2.8 MB


**Observation:** This confirms the changes. The index is now a `DatetimeIndex`, and the `datetime` column has been removed from the list of columns. The data is now properly formatted as a time series.