<a href="https://colab.research.google.com/github/Johny85/World-of-Scripts/blob/master/Datascience_Project2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hello\! 👋 As your data science mentor, I'd be happy to help you with your AQI project. This is an excellent topic for a comparative analysis, and breaking it down into manageable steps is key. I'll provide you with a comprehensive, step-by-step guide with the necessary Python code, explanations, and guidance on best practices for each stage of your project.

Let's begin with the first and most critical task: **Data Preprocessing**.

-----

### 1\. Data Preprocessing 🧹

Before any analysis can be performed, you must prepare your data. This involves loading the data, combining it, cleaning it, and ensuring it is in a usable format.

#### **Guidance**

  * **File Loading and Combining:** The most efficient way to handle multiple CSV files is to load them programmatically in a loop. I will provide code that iterates through your files, adds a `City` column to each, and then combines them into a single pandas `DataFrame`.
  * **Column Standardization:** The column names in your datasets include special characters and units, such as `PM2.5 (µg/m³)` and `AT (°C)`. It is a best practice to standardize these names into a more machine-readable format (e.g., `pm2_5`, `temp_c`).
  * **Handling Missing Values:** Your data snippets show missing values as blank spaces or `NA`. These should be replaced with `NaN` (Not a Number) for proper numerical operations. After that, you can choose a strategy for imputation. Given the time-series nature of your data, **forward or backward fill (`fillna(method='ffill')`)** or **interpolation (`interpolate()`)** are often better choices than simple mean or median imputation, as they preserve the temporal relationships in the data.
  * **Data Aggregation:** Your data appears to be recorded on a sub-daily basis. For your analysis and modeling, aggregating the data to a **daily or hourly level** is essential. You can use pandas' `resample()` method for this.
  * **Handling Duplicates:** Always check for and remove duplicate rows to ensure the integrity of your dataset.

#### **Python Code**

Here is the Python code to perform the initial data preprocessing. You can copy and paste this into your Jupyter notebook and run it. **Please note:** If you are running this in a local environment, you may need to adjust the file paths in the `file_list` to where your files are stored. The code below assumes all the files are in the same directory.

In [None]:
import pandas as pd
import glob

# Step 1: Read all the CSV files and combine them into a single DataFrame.
# You can use glob.glob to get all files ending with .csv in your current directory.
# If you are in a specific subdirectory, you might need to adjust the path.
file_list = glob.glob('*.csv')

# If glob doesn't work, you can manually list your files as a fallback:
# file_list = [
#     'Anantapur.csv', 'Chittoor.csv', 'Kadapa.csv', 'Rajamahendravaram.csv',
#     'Tirupati.csv', 'Vijayawada.csv', 'Visakhapatnam.csv', 'Amravati.csv'
# ]

# Initialize an empty list to store the DataFrames.
df_list = []

# Loop through each file, read it into a DataFrame, and add a 'City' column.
for file in file_list:
    city_name = file.replace('.csv', '')
    df = pd.read_csv(file)
    df['City'] = city_name
    df_list.append(df)

# Concatenate all the DataFrames into one.
combined_df = pd.concat(df_list, ignore_index=True)

# Step 2: Standardize column names.
new_columns = {
    'PM2.5 (µg/m³)': 'pm2_5',
    'PM10 (µg/m³)': 'pm10',
    'NO (µg/m³)': 'no',
    'NO2 (µg/m³)': 'no2',
    'NOx (ppb)': 'nox',
    'NH3 (µg/m³)': 'nh3',
    'SO2 (µg/m³)': 'so2',
    'CO (mg/m³)': 'co',
    'Ozone (µg/m³)': 'ozone',
    'Benzene (µg/m³)': 'benzene',
    'Toluene (µg/m³)': 'toluene',
    'Xylene (µg/m³)': 'xylene',
    'O Xylene (µg/m³)': 'o_xylene',
    'Eth-Benzene (µg/m³)': 'eth_benzene',
    'MP-Xylene (µg/m³)': 'mp_xylene',
    'AT (°C)': 'temp_c',
    'RH (%)': 'rh_percent',
    'WS (m/s)': 'ws_m_s',
    'WD (deg)': 'wd_deg',
    'RF (mm)': 'rf_mm',
    'TOT-RF (mm)': 'tot_rf_mm',
    'SR (W/mt2)': 'sr_w_mt2',
    'BP (mmHg)': 'bp_mmHg',
    'VWS (m/s)': 'vws_m_s',
    'Timestamp': 'timestamp'
}
combined_df.rename(columns=new_columns, inplace=True)

# Step 3: Handle missing values and convert data types.
# Replace common representations of missing data with pandas' NaN.
combined_df.replace(['', 'NA'], pd.NA, inplace=True)

# Convert 'timestamp' to datetime and set it as the index.
combined_df['timestamp'] = pd.to_datetime(combined_df['timestamp'])
combined_df.set_index('timestamp', inplace=True)

# Convert all relevant columns to numeric type, coercing errors to NaN.
for col in combined_df.columns:
    if col not in ['City']:
        combined_df[col] = pd.to_numeric(combined_df[col], errors='coerce')

# Step 4: Check for and drop duplicates.
duplicates = combined_df.duplicated().sum()
print(f"Found {duplicates} duplicate rows.")
if duplicates > 0:
    combined_df.drop_duplicates(inplace=True)
    print("Dropped duplicate rows.")

# Step 5: Resample the data to a daily average.
# This is a key step for your analysis and modeling.
daily_df = combined_df.resample('D').mean()
# Add the city information back after resampling.
daily_df['City'] = combined_df['City'].iloc[0]

# You can save this cleaned and resampled data to a new CSV file.
daily_df.to_csv('cleaned_aqi_data.csv')

print("\n--- Preprocessing Complete ---")
print("Head of the cleaned and resampled DataFrame:")
print(daily_df.head())
print("\nDataFrame Info:")
daily_df.info()

-----

### 2\. Exploratory Data Analysis (EDA) 📊

Once your data is clean, you can start exploring it to uncover patterns, trends, and anomalies.

#### **Guidance**

  * **Time-Series Plots:** The best visualization for a time-series is a line plot. You can use **`matplotlib`** or **`seaborn`** to plot pollutants like PM2.5 and PM10 over time. Make a separate plot for each city or use a faceted plot to compare them.
  * **Seasonal Decomposition:** To analyze seasonal peaks, you can use `statsmodels.tsa.seasonal.seasonal_decompose`. This will break down your time series into three components: **trend**, **seasonality**, and **residuals**. This is an excellent way to quantify the yearly cycle of air pollution.
  * **Cross-City Boxplots/Heatmaps:** To compare pollution levels across cities, a **boxplot** is an ideal choice. It will show the median, quartiles, and outliers for each city's PM2.5 and PM10 levels. A **heatmap** of a correlation matrix is perfect for visualizing the relationships between pollutants and meteorological drivers.

#### **Python Code & Visualization Choices**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose

# Re-load the cleaned data if you are starting a new notebook session
# daily_df = pd.read_csv('cleaned_aqi_data.csv', index_col='timestamp', parse_dates=True)

# --- Visualization 1: Time-series Plot for PM2.5 ---
plt.figure(figsize=(15, 6))
sns.lineplot(data=daily_df, x=daily_df.index, y='pm2_5')
plt.title('PM2.5 Levels Over Time (Daily Average)')
plt.xlabel('Date')
plt.ylabel('PM2.5 (µg/m³)')
plt.grid(True)
plt.show()

# --- Visualization 2: Seasonal Boxplots ---
# Create a 'Month' column for seasonal analysis.
daily_df['month'] = daily_df.index.month
plt.figure(figsize=(12, 8))
sns.boxplot(x='month', y='pm2_5', data=daily_df)
plt.title('Monthly PM2.5 Levels')
plt.xlabel('Month')
plt.ylabel('PM2.5 (µg/m³)')
plt.grid(True)
plt.show()

# --- Visualization 3: Correlation Heatmap ---
# Select the columns for the heatmap.
pollutants = ['pm2_5', 'pm10', 'no2', 'so2', 'ozone', 'co']
meteorological = ['temp_c', 'rh_percent', 'ws_m_s', 'rf_mm']
all_cols = pollutants + meteorological
correlation_matrix = daily_df[all_cols].corr(method='pearson')
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Pollutants and Meteorological Drivers')
plt.show()

# --- Seasonal Decomposition ---
# Choose a city to decompose.
city_data = daily_df[daily_df['City'] == 'Visakhapatnam']['pm2_5'].dropna()
# The period parameter should be set to the frequency of your data's seasonality.
# Since we are using daily data with yearly seasonality, the period is 365.
decomposition = seasonal_decompose(city_data, model='additive', period=365)
decomposition.plot()
plt.suptitle('Seasonal Decomposition of PM2.5 for Visakhapatnam', y=1.02)
plt.show()

-----

### 3\. Analysis and Modeling 📈

This is where you'll quantify the relationships and make predictions.

#### **Guidance**

  * **Correlation:** Use both **Pearson** and **Spearman** correlation coefficients. Pearson measures linear relationships, while Spearman measures monotonic (but not necessarily linear) relationships. This will help you understand if the relationship between pollutants and meteorological drivers is linear or more complex.
  * **Regression Analysis:** A linear regression model is a great starting point to quantify the influence of meteorological factors. The R-squared value will tell you the percentage of the variance in a pollutant that can be explained by the meteorological variables.
  * **ANN for Prediction:** For your 7-day forecast, an Artificial Neural Network (ANN) is an excellent choice. You will need to **scale your data** using `StandardScaler`, create a supervised learning dataset with lagged features, and then train the model using `scikit-learn` or `keras`.

#### **Python Code**

In [None]:
from scipy.stats import pearsonr, spearmanr
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# --- Part 1: Correlation Analysis ---
# Example for PM2.5 and temperature
pm25_temp_pearson = pearsonr(daily_df['pm2_5'].dropna(), daily_df['temp_c'].dropna())
pm25_temp_spearman = spearmanr(daily_df['pm2_5'].dropna(), daily_df['temp_c'].dropna())
print(f"Pearson Correlation (PM2.5 vs Temp): r={pm25_temp_pearson.statistic:.2f}, p-value={pm25_temp_pearson.pvalue:.2e}")
print(f"Spearman Correlation (PM2.5 vs Temp): rho={pm25_temp_spearman.statistic:.2f}, p-value={pm25_temp_spearman.pvalue:.2e}")

# --- Part 2: Regression Analysis (PM2.5 vs Meteorological Drivers) ---
# Assuming you have filled the missing values and selected your features
features = ['temp_c', 'rh_percent', 'ws_m_s', 'rf_mm']
target = 'pm2_5'
# Drop rows with any NaN values for the analysis
regression_data = daily_df.dropna(subset=features + [target])
X = regression_data[features]
y = regression_data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\nLinear Regression Model Performance:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}") # This is your % variance explained!

# --- Part 3: ANN for 7-day Prediction (conceptual outline) ---
# NOTE: This is a high-level template. You will need to build the
# time-series dataset with lagged features for a true forecast.
# For simplicity, this example shows a basic ANN setup.
# Scale the data before feeding it to the ANN.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Build the ANN model
model = Sequential()
model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(32, activation='relu'))
model.add(Dense(1)) # Output layer for a single value prediction

# Compile and train the model
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(X_train, y_train, epochs=50, validation_split=0.2, verbose=0)
print("\nANN Model Training Complete.")

-----

### 4\. Interpretation and Deliverables 📝

#### **Guidance**

  * **Report:** Your report should be structured logically. Start with an **Abstract/Objectives**, move to **Methods** (data collection, preprocessing, modeling), then present **Key Findings** (your EDA and analysis results), and conclude with a **Discussion of Limitations** and **Future Work**.
  * **Slide Deck:** For your presentation, focus on one key insight per slide. Use a strong visual and a concise title. For example:
      * **Slide 1: Title** (Project Name, Your Name)
      * **Slide 2: Objectives** (Briefly state your goals)
      * **Slide 3: Data Overview** (Show a table or summary of the data)
      * **Slide 4: Seasonal Trends** (Show a seasonal decomposition plot)
      * **Slide 5: City-to-City Comparison** (Show a boxplot of PM2.5 levels)
      * **Slide 6: Correlation Analysis** (Show the heatmap)
      * **Slide 7: Regression Results** (Display the R-squared value)
      * **Slide 8: Forecasting Model** (Briefly explain your ANN)
      * **Slide 9: Key Findings & Conclusion** (Summarize your main insights)
      * **Slide 10: Q\&A** (Open the floor for questions)

You have a solid plan and the tools you need to get started. Remember to document your steps, as this will make writing your final report much easier. Good luck with your project\! You've got this\! ✨