# **Project 3: Basic Weather Analysis**

In this project, we will analyze a daily weather dataset containing temperature, humidity, wind speed, and mean pressure. The goal is to clean the data, calculate basic statistics, split the dataset into seasons, and export the results to a JSON file. This notebook is designed for learning, so each step will be explained in detail.

---

## **Specific Tasks:**

1. **Download a daily weather dataset (e.g., temperature, humidity, precipitation).**
2. **Clean the data, including:**
   - Converting date columns to the correct format.
   - Detecting and handling outliers in temperature data.
3. **Calculate basic statistics:**
   - Monthly maximum, minimum, and average temperatures.
4. **Split the dataset into seasons (spring, summer, autumn, winter).**
5. **Export the results to a JSON file.**

---

## **Step 1: Load the Dataset**

We’ll start by loading the dataset into a Pandas DataFrame.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("DailyDelhiClimate.csv")

# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
df.head()

First 5 rows of the dataset:


Unnamed: 0,date,meantemp,humidity,wind_speed,meanpressure
0,2017-01-01,15.913043,85.869565,2.743478,59.0
1,2017-01-02,18.5,77.222222,2.894444,1018.277778
2,2017-01-03,17.111111,81.888889,4.016667,1018.333333
3,2017-01-04,18.7,70.05,4.545,1015.7
4,2017-01-05,18.388889,74.944444,3.3,1014.333333


### **Explanation:**
- **`pd.read_csv()`**: This function reads the CSV file and loads it into a Pandas DataFrame. The dataset contains columns like `date`, `meantemp`, `humidity`, `wind_speed`, and `meanpressure`.
- **`df.head()`**: This displays the first 5 rows of the dataset, giving us a quick overview of the data structure.

---

## **Step 2: Clean the Data**

### **2.1. Convert Date Column to the Correct Format**

The `date` column is currently in a string format. To make it easier to work with, we’ll convert it to a datetime format.

In [2]:
# Convert the 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'])

# Display the updated DataFrame
print("DataFrame after converting 'date' column to datetime:")
df.head()

DataFrame after converting 'date' column to datetime:


Unnamed: 0,date,meantemp,humidity,wind_speed,meanpressure
0,2017-01-01,15.913043,85.869565,2.743478,59.0
1,2017-01-02,18.5,77.222222,2.894444,1018.277778
2,2017-01-03,17.111111,81.888889,4.016667,1018.333333
3,2017-01-04,18.7,70.05,4.545,1015.7
4,2017-01-05,18.388889,74.944444,3.3,1014.333333


### **Explanation:**
- **`pd.to_datetime()`**: This function converts the `date` column from a string format (e.g., "2017-01-01") to a datetime format. This allows us to perform date-based operations, such as extracting the month or year.
- After conversion, the `date` column will be in a format that Pandas recognizes as dates, making it easier to filter or group data by date.

---

### **2.2. Detect and Handle Outliers in Temperature Data**

Outliers are data points that are significantly different from the rest of the data. They can skew our analysis, so we’ll detect and handle them using the **Interquartile Range (IQR)** method.

In [3]:
# Calculate the IQR for the 'meantemp' column
Q1 = df['meantemp'].quantile(0.25)  # First quartile (25th percentile)
Q3 = df['meantemp'].quantile(0.75)  # Third quartile (75th percentile)
IQR = Q3 - Q1  # Interquartile range

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
df = df[(df['meantemp'] >= lower_bound) & (df['meantemp'] <= upper_bound)]

# Display the DataFrame after removing outliers
print("DataFrame after removing temperature outliers:")
df.head()

DataFrame after removing temperature outliers:


Unnamed: 0,date,meantemp,humidity,wind_speed,meanpressure
0,2017-01-01,15.913043,85.869565,2.743478,59.0
1,2017-01-02,18.5,77.222222,2.894444,1018.277778
2,2017-01-03,17.111111,81.888889,4.016667,1018.333333
3,2017-01-04,18.7,70.05,4.545,1015.7
4,2017-01-05,18.388889,74.944444,3.3,1014.333333


### **Explanation:**
- **IQR Method**: The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). Outliers are defined as data points that fall below `Q1 - 1.5 * IQR` or above `Q3 + 1.5 * IQR`.
- **Steps**:
  1. Calculate `Q1` and `Q3` for the `meantemp` column.
  2. Compute the IQR as `Q3 - Q1`.
  3. Define the lower and upper bounds for outliers.
  4. Filter the DataFrame to remove rows where the `meantemp` falls outside these bounds.
- This ensures that extreme values (outliers) do not distort our analysis.

---

## **Step 3: Calculate Basic Statistics**

### **3.1. Monthly Maximum, Minimum, and Average Temperatures**

We’ll calculate the monthly maximum, minimum, and average temperatures to understand how the temperature varies throughout the year.

In [4]:
# Extract the month from the 'date' column
df['month'] = df['date'].dt.month

# Group by month and calculate statistics
monthly_stats = df.groupby('month')['meantemp'].agg(['max', 'min', 'mean']).reset_index()

# Rename columns for clarity
monthly_stats.columns = ['Month', 'Max Temp', 'Min Temp', 'Avg Temp']

# Display the monthly statistics
print("Monthly temperature statistics:")
monthly_stats

Monthly temperature statistics:


Unnamed: 0,Month,Max Temp,Min Temp,Avg Temp
0,1,21.0,11.0,15.710873
1,2,23.375,14.666667,18.349981
2,3,31.0,17.375,23.75376
3,4,34.5,25.625,30.753663


### **Explanation:**
- **`df['date'].dt.month`**: This extracts the month from the `date` column. For example, if the date is "2017-01-01", this will return `1` (January).
- **`groupby('month')`**: This groups the data by month, allowing us to calculate statistics for each month.
- **`agg(['max', 'min', 'mean'])`**: This calculates the maximum, minimum, and average temperatures for each month.
- **`reset_index()`**: This converts the grouped data back into a DataFrame for easier manipulation.
- The resulting `monthly_stats` DataFrame contains the monthly temperature statistics.

---

## **Step 4: Split the Dataset into Seasons**

We’ll split the dataset into seasons (spring, summer, autumn, winter) based on the month.

In [5]:
# Define a function to map months to seasons
def get_season(month):
    if month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    elif month in [9, 10, 11]:
        return 'Autumn'
    else:
        return 'Winter'

# Apply the function to create a 'season' column
df['season'] = df['month'].apply(get_season)

# Display the DataFrame with the new 'season' column
print("DataFrame with 'season' column:")
df.head()

DataFrame with 'season' column:


Unnamed: 0,date,meantemp,humidity,wind_speed,meanpressure,month,season
0,2017-01-01,15.913043,85.869565,2.743478,59.0,1,Winter
1,2017-01-02,18.5,77.222222,2.894444,1018.277778,1,Winter
2,2017-01-03,17.111111,81.888889,4.016667,1018.333333,1,Winter
3,2017-01-04,18.7,70.05,4.545,1015.7,1,Winter
4,2017-01-05,18.388889,74.944444,3.3,1014.333333,1,Winter


### **Explanation:**
- **`get_season()` Function**: This function maps each month to a season:
  - **Spring**: March (3), April (4), May (5)
  - **Summer**: June (6), July (7), August (8)
  - **Autumn**: September (9), October (10), November (11)
  - **Winter**: December (12), January (1), February (2)
- **`df['month'].apply(get_season)`**: This applies the `get_season()` function to each row in the `month` column, creating a new `season` column.
- The resulting DataFrame now includes a `season` column, which we can use for seasonal analysis.

---

## **Step 5: Export Results to a JSON File**

Finally, we’ll export the cleaned dataset and the monthly statistics to JSON files for further use or sharing.

In [6]:
# Export the cleaned dataset to a JSON file
df.to_json('cleaned_weather_data.json', orient='records', lines=True)

# Export the monthly statistics to a JSON file
monthly_stats.to_json('monthly_temperature_stats.json', orient='records', lines=True)

print("Results saved to 'cleaned_weather_data.json' and 'monthly_temperature_stats.json'")

Results saved to 'cleaned_weather_data.json' and 'monthly_temperature_stats.json'


### **Explanation:**
- **`to_json()`**: This function exports the DataFrame to a JSON file.
- **`orient='records'`**: This ensures the data is exported in a record-oriented format, where each row becomes a JSON object.
- **`lines=True`**: This writes each record as a separate line in the JSON file, making it easier to process large datasets.
- The cleaned dataset is saved as `cleaned_weather_data.json`, and the monthly statistics are saved as `monthly_temperature_stats.json`.

---

## **Conclusion**

In this project, we:
1. Loaded and cleaned a weather dataset.
2. Converted the `date` column to a datetime format and handled temperature outliers.
3. Calculated monthly temperature statistics.
4. Split the dataset into seasons.
5. Exported the results to JSON files.

The cleaned dataset and analysis results are now ready for further use or reporting.



*Dataset: https://www.kaggle.com/datasets/mahirkukreja/delhi-weather-data*