**Author:** Shahab Fatemi

**Email:** shahab.fatemi@umu.se   ;   shahab.fatemi@amitiscode.com

**Created:** 2024-11-xx

**Last update:** 2025-08-24

**MIT License** — Shahab Fatemi (2025); For use in the *Machine Learning in Physics* course, Umeå University, Sweden; See the full license text in the parent folder.

<hr>

# Practical Data Analysis

In this notebook, we want to work with the data we have collected and perform some analysis on it. This will involve cleaning the data, exploring it visually, and applying some statistical methods to gain insights. We keep everything simple and focuse on the key aspects of the data.

***

We aim to analyze the weather data collected for Chicago, Illinois, over a period of nearly nine years, from January 1, 2015, to December 31, 2023. This dataset provides information on the temperature trends, precipitation patterns, wind conditions, and more.

### Data Source

I collected this data from [Meteostat](https://meteostat.net).

### Dataset Overview

The dataset consists of the following columns:

| Column | Description                       |
|--------|---------------------------------  |
| `date` | Date of the recorded weather data |
| `tavg` | Average Temperature (°C)          |
| `tmin` | Minimum Temperature (°C)          |
| `tmax` | Maximum Temperature (°C)          |
| `prcp` | Total Precipitation (mm)          |
| `snow` | Snow Depth (cm)                   |
| `wdir` | Wind Direction (degrees)          |
| `wspd` | Wind Speed (km/h)                 |
| `pres` | Air Pressure (hPa)                |
| `weather` | Weather Condition (descriptive)|

### Tools and methods

To conduct this analysis, we will utilize various tools and libraries, including:

- **Pandas**: To handle and analyze the dataset.
- **Matplotlib and Seaborn**: For data visualization.

We did not explicitly discuss `Pandas` earlier. However, through this practical project, we will also learn basics of `Pandas`.

### Main Steps

- **Data Loading**: Load the dataset into a Pandas DataFrame.
- **Data Cleaning**: Perform necessary data cleaning steps, such as handling missing values and converting data types.
- **Exploratory Data Analysis (EDA)**: Conduct an EDA to understand the dataset better and derive preliminary insights.
- **Visualization**: Create visualizations to communicate findings effectively.
- **Conclusion**: Summarize the analysis and discuss potential future work based on the results.

Let's get started!
***

### Data Loading

In this step, we will load the weather dataset into a Pandas DataFrame for analysis.

`Pandas` is a very important library in Python, designed to provide flexible and powerful data structures that facilitate working with structured data, such as time series and tabular data. One of its primary features is the `DataFrame`, a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure that organizes data into rows and columns. Each column can contain different data types, and both rows and columns are labeled, making it easy to access and manipulate data. Pandas is widely used in data science, machine learning, and statistical analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the weather dataset from a CSV file and ignore commented lines denoted by '#' in the file.
weather_data = pd.read_csv('../datasets/Chicago_weather.csv', comment='#')

### Data viewing

In [None]:
# Displays the first 5 rows of the data
weather_data.head()

# OR
weather_data.head(5)

In [None]:
# Displays the last 5 rows
weather_data.tail()

Next, we use `info()` to get a concise summary of the DataFrame, including the number of non-null entries, data types of each column, and memory usage. This is useful for understanding the structure of the data, especially with regards to missing values.

In [None]:
weather_data.info()

⚠️ We immediately notice that `data` and `weather` attribute have the data type of "object". We will come back to this issue later.

Now, we use `describe()` to generate descriptive statistics for numeric columns, including count, mean, standard deviation, min, max, and quartiles. This helps in getting a quick statistical overview of the dataset.

In [None]:
weather_data.describe()

### Accessing Data

You can access a specific column by its label.

In [None]:
avg_temperature = weather_data['tavg']  # Access the 'tavg' column

avg_temperature.info()

You can access specific rows using loc (label-based) or iloc (integer-based) indexing.

In [None]:
first_row = weather_data.loc[0]  # Access the first row
first_row_by_index = weather_data.iloc[0]  # Also accesses the first row

print(first_row)

### Data Filtering

You can filter the DataFrame based on conditions. For example, to find rows where the average temperature is above 30 C.

In [None]:
hot_days = weather_data[weather_data['tavg'] > 30]

hot_days.describe()

### Modifying the DataFrame

You can create new columns based on existing data. For example, create a column for the temperature in Fahrenheit:

In [None]:
weather_data['tavg_f'] = (weather_data['tavg'] * 9/5) + 32

# Display the updated DataFrame, and see the new column appearing as the last column.
weather_data.describe()

You can remove unwanted columns using the drop() method:

In [None]:
weather_data = weather_data.drop(columns=['tavg_f'])  # Drop our newly made 'tavg_f' column

weather_data.describe()

### Data Cleaning

We noted earlier that `weather` attribute has the data type of "object". In data analysis and Machine Learning algorithms, we work with numbers. Therefore, any non-number value should be converted to number.

In order to do this, we need to find the unique occurrences of each distinct value in the `weather` column. One way is through: 

In [None]:
weather_data['weather'].value_counts()

Now, we know there are 4 unique weather conditions: 'Sunny', 'Rainy', 'Snowy', and 'Misty' in our dataset. We can map these conditions to numerical values. One way to do this is by using the `factorize` function from pandas. In this method, each unique category in `weather` is assigned a unique integer. The `factorize` function also returns the unique values found.

In [None]:
weather_data['weather_encoded'], weather_categories = pd.factorize(weather_data['weather'])

print("Unique encoded values:", weather_categories)

weather_data['weather_encoded'].value_counts()

In case you did not notice earlier, our dataset contains NaNs and Nulls. You could see it earlier when we explored the dataset using `info()`. We should handle these missing values before proceeding with data analysis. Depending on the purpose of the analysis, we might choose to fill these missing values or drop them. A common approach is to fill the missing values with a specific value, such as the mean or median of the column. Let's fill the missing values with the mean of each column:

In [None]:
# Fill missing values with the mean of each column for numerical features only.
weather_data = weather_data.fillna(weather_data.mean(numeric_only=True))

weather_data.info()

⚠️ Depending on the purpose of the analysis, you might need to keep the original data and create a copy for analysis. This way, you can always refer back to the original dataset if needed. However, since I did not want to make it complicated here, I work with the modified dataset directly. BUT, keep in mind that this approach may not be suitable for all situations, especially if you need to compare the results with the original data.

### Data augmentation 

We can augment our dataset by creating new samples from the existing ones. This is particularly useful in Machine Learning where we have limited data. Let's create an extra column to the data that consists of the season:

In [None]:
# Create a new column based on the month extracted from the 'date' column.
# Not an accurate representation of the seasons!
def get_season(date_str):
    month = int(date_str.split('-')[1])
    if month in [12, 1, 2]:
        return 4   # Winter
    elif month in [3, 4, 5]:
        return 1   # Spring
    elif month in [6, 7, 8]:
        return 2   # Summer
    else:
        return 3   # Autumn

# Create a new column in the DataFrame by applying the get_season function to the 'date' column
weather_data['season'] = weather_data['date'].apply(get_season)

weather_data['season'].value_counts()

Note that the season function is not accurate, as it does not account for the actual start and end dates of each season. I wanted to keep the function simple, but for a more accurate representation, consider using the actual seasonal dates.

### Data Visualization

Let's visualize the entire dataset:

In [None]:
weather_data.hist(bins=50, figsize=(12, 8))
plt.tight_layout()

# If you want to change the figure dpi, you can only do it when saving the figure:
## plt.savefig("weather_hist.png", dpi=200)

In the code below, we plot the distribution of average temperatures in the dataset.

In [None]:
weather_data['tavg'].hist(bins=50, 
                          figsize=(6, 4), 
                          color='skyblue', 
                          edgecolor='k')
plt.xlabel("Average Temperature")
plt.ylabel("Data Frequency")
plt.title("Distribution of Average Temperature")
plt.grid(axis='y', alpha=0.75)
plt.tight_layout()

Another way to visualize the distribution of average temperatures is by using the Kernel Density Estimate (KDE) plot in Seaborn.
I, *personally*, prefer using Seaborn over the visualization functionalities existing in Pandas.

In [None]:
import seaborn as sns
plt.figure(figsize=(4, 3), dpi=200)
sns.histplot(weather_data['tavg'], bins=50, color="forestgreen", kde=True)
plt.grid(True, linestyle="--", alpha=0.7)
plt.show()

### Correlation Analysis

We want to understand the relationships between different weather variables. Correlation analysis helps us identify which variables are positively or negatively correlated.

We can visualize the correlation matrix using Pandas' scatter_matrix function.

In [None]:
pd.plotting.scatter_matrix(weather_data, figsize=(12, 12))
plt.show()

Or, we can use Seaborn:

In [None]:
# Compute correlation matrix (only numeric columns)
corr = weather_data.corr(numeric_only=True)

# Show heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr, 
            annot=True, fmt=".2f", 
            vmax=1.0, vmin=-1.0, 
            linewidths=0.1,
            cmap="coolwarm", 
            square=True, cbar=True)
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()

There is a lot to see in the heatmap above. We can also only show those that are strongly correlated:

In [None]:
# Show heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.3)],  # Just for example. We know the thresholds can be adjusted.
            annot=True, fmt=".2f", 
            vmax=1.0, vmin=-1.0, 
            linewidths=0.1,
            cmap="coolwarm", 
            square=True, cbar=True)
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()

Let's visualize the correlation between temperature and pressure.

In [None]:
sns.regplot(data=weather_data, 
            x="tavg", 
            y="pres", 
            #color='forestgreen',
            ci=90,  # Confidence interval for the regression line
            line_kws={'color': 'darkred'})
plt.title("Temperature vs. Pressure")
plt.grid(True)
plt.show()

You can do a lot more, which you will learn the rest throughout the MLP course.

***
END
***