_**DELETE BEFORE PUBLISHING**_

_This is a template also containing the style guide for use cases. The styling uses the use-case css when uploaded to the website, which will not be visible on your local machine._

_Change any text marked with {} and delete any cells marked DELETE_

***

In [1]:
# DELETE BEFORE PUBLISHING
# This is just here so you can preview the styling on your local machine

from IPython.core.display import HTML
HTML("""
<style>
.usecase-title, .usecase-duration, .usecase-section-header {
    padding-left: 15px;
    padding-bottom: 10px;
    padding-top: 10px;
    padding-right: 15px;
    background-color: #0f9295;
    color: #fff;
}

.usecase-title {
    font-size: 1.7em;
    font-weight: bold;
}

.usecase-authors, .usecase-level, .usecase-skill {
    padding-left: 15px;
    padding-bottom: 7px;
    padding-top: 7px;
    background-color: #baeaeb;
    font-size: 1.4em;
    color: #121212;
}

.usecase-level-skill  {
    display: flex;
}

.usecase-level, .usecase-skill {
    width: 50%;
}

.usecase-duration, .usecase-skill {
    text-align: right;
    padding-right: 15px;
    padding-bottom: 8px;
    font-size: 1.4em;
}

.usecase-section-header {
    font-weight: bold;
    font-size: 1.5em;
}

.usecase-subsection-header, .usecase-subsection-blurb {
    font-weight: bold;
    font-size: 1.2em;
    color: #121212;
}

.usecase-subsection-blurb {
    font-size: 1em;
    font-style: italic;
}
</style>
""")

<div class="usecase-title"><b>Weather Condition Classification</b></div>

<div class="usecase-authors"><b>Authored by: </b>Aremu Akintomiwa James</div>

<div class="usecase-duration"><b>Duration:</b> {90} mins</div>

<div class="usecase-level-skill">
    <div class="usecase-level"><b>Level: </b>{Intermediate}</div>
    <div class="usecase-skill"><b>Pre-requisite Skills: </b>Python, Data analysis, Machine Learning, Basic Meteorology</div>
</div>

<div class="usecase-section-header"><b>Scenario</b></div>

 As an urban planner or agricultural manager, I need to accurately classify different weather conditions using environmental features to determine the optimal times for infrastructure projects and agricultural activities. This will ensure that operations are conducted under favorable weather conditions, thereby providing actionable insights for planning and decision-making.



<div class="usecase-section-header"><b>What this use case will teach you</b></div>

At the end of this use case you will:
- Understand how to preprocess and analyze environmental data.
- Learn how to build and evaluate a machine learning model for classification tasks.
- Gain experience in feature selection and engineering for weather-related datasets.
- Develop skills in using Python libraries such as Pandas, Scikit-learn, and Matplotlib.
- Understand the importance of accurate weather classification for planning and decision-making in various sectors.


<div class="usecase-section-header"><b> introduction</b></div>

In this use case, we aim to develop a robust machine learning model capable of accurately classifying various weather conditions such as sunny, cloudy, rainy, and stormy using environmental features. These features include ambient air temperature, relative humidity, atmospheric pressure, wind speed and direction, and gust wind speed. Accurate weather classification is important for optimizing the timing of infrastructure projects and agricultural activities, ensuring that operations are conducted under favorable weather conditions. By leveraging machine learning techniques, we can provide actionable insights for planning and decision-making.



<div class="usecase-section-header"><b>Background</b></div>

Weather conditions have a significant impact on various sectors, including agriculture, construction, and transportation. Accurate weather forecasts and classifications can help in planning and executing operations more efficiently. For instance, farmers can optimize planting and harvesting times based on expected weather conditions, while construction projects can be scheduled to avoid adverse weather that could delay progress or compromise safety.

In this project, we will use historical weather data from Melbourne's open data portal. The datasets include:
- Microclimate sensors data — CoM Open Data Portal (melbourne.vic.gov.au)
- Argyle Square Weather Stations (Historical Data) — CoM Open Data Portal (melbourne.vic.gov.au)

- Argyle Square Air Quality — CoM Open Data Portal (melbourne.vic.gov.au)


<div class="usecase-section-header"><b>Dataset Information</b></div>

The dataset for this project includes the following features::
- Ambient air temperature (°C)
- Relative humidity (%)
- Atmospheric pressure (hPa)
- Wind speed (m/s)
- Wind direction (degrees)
- Gust wind speed (m/s)


These features will be used to classify weather conditions into categories such as sunny, cloudy, rainy, and stormy. The dataset will be preprocessed to handle any missing values, outliers, or inconsistencies before being used to train the machine learning model.

In [2]:
#dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import requests
from io import StringIO
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import AdaBoostClassifier

In [3]:
# **Preferred Method**: Export Endpoint


def API_unlimited(datasetName):

    dataset_id = datasetName
    # https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/information/
    #dataset_id = 'pedestrian-counting-system-monthly-counts-per-hour'
    
    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    #apikey = api_key
    dataset_id = dataset_id
    format = 'csv'
    
    url = f'{base_url}{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC',
       # 'api_key': apikey
    }
    
    # GET request
    response = requests.get(url, params=params)
    
    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        datasetName = pd.read_csv(StringIO(url_content), delimiter=';')
        print(datasetName.sample(10, random_state=999)) # Test
        return datasetName
    else:
        print(f'Request failed with status code {response.status_code}')

In [None]:
dataset_id_1 = 'microclimate-sensors-data'
dataset_id_2 = 'meshed-sensor-type-1'
dataset_id_3 = 'argyle-square-air-quality'

dataset1 = API_unlimited(dataset_id_1)

In [None]:
dataset2 = API_unlimited(dataset_id_2)

In [None]:
dataset3 = API_unlimited(dataset_id_3)

In [None]:
dataset1.head()

In [None]:
dataset1.shape

In [None]:
dataset1.columns

In [None]:
dataset2.head()

In [None]:
dataset2.shape

In [None]:
dataset2.columns

In [None]:
dataset3.head()

In [None]:
dataset3.shape

In [None]:
dataset3.columns

#### standardizing column names for dataset1

In [None]:
dataset1 = dataset1.rename(columns={'received_at':'time', 'latlong':'lat_long', 'minimumwinddirection':'min_wind_direction', 'averagewinddirection':'avg_wind_direction', 'maximumwinddirection':'max_wind_direction', 
                        'minimumwindspeed':'min_wind_speed', 'averagewindspeed':'avg_wind_speed', 'gustwindspeed':'gust_wind_speed',
          'airtemperature':'air_temp', 'relativehumidity':'humidity', 'atmosphericpressure':'atm_pressure'})
dataset1.head()

#### atandersizing column names for datasets2

In [None]:
dataset2 =dataset2.rename(columns={'windspeed':'avg_wind_speed', 'winddirection':'avg_wind_direction', 'gustspeed':'gust_wind_speed',
           'atmosphericpressure':'atm_pressure', 'relativehumidity':'humidity', 'airtemp':'air_temp'})
dataset2.head()

#### standersizing column names for datasets3

In [None]:
dataset3.rename(columns={'temperature':'air_temp',})
dataset3.head()

In [None]:
dataset3.shape

#### merging the datasets together

In [None]:
# Convert time column in df1 to datetime with UTC timezone
#dataset2['time'] = pd.to_datetime(dataset2['time'], utc=True)

In [None]:
combine_df = pd.concat([dataset2, dataset3], axis=1,join='inner')
combine_df.head()

In [None]:
missing_percent =combine_df.isna().sum()*100/len(combine_df)
missing_percent

In [None]:
combine_df.columns


In [None]:
plt.figure(figsize=(7,3))
new_par = missing_percent[missing_percent <= 10].plot.bar()
plt.gca().set_xlabel("columns")
plt.gca().set_ylabel("counts")
plt.gca().set_title("percentage of missing value")
plt.grid()
plt.show()

In [None]:
##column to work with
selected_columns =['time', 'lat_long','rtc', 'battery', 'solarpanel', 'command', 'solar',
       'precipitation', 'strikes', 'avg_wind_speed', 'avg_wind_direction',
       'gust_wind_speed', 'vapourpressure', 'atm_pressure', 'humidity',
       'air_temp', 'lat_long',   'averagespl', 'carbonmonoxide', 'humidity', 'ibatt',
       'nitrogendioxide', 'ozone', 'particulateserr', 'particulatesvsn',
       'peakspl', 'pm1', 'pm10', 'pm25', 'temperature', 'vbatt', 'vpanel']


new_df = combine_df[selected_columns]
new_df.head()

In [None]:
new_df.columns = ['time_1', 'time_2', 'lat_long_1', 'lat_long_2', 'rtc', 'battery', 'solarpanel',
       'command', 'solar', 'precipitation', 'strikes', 'avg_wind_speed',
       'avg_wind_direction', 'gust_wind_speed', 'vapourpressure',
       'atm_pressure', 'humidity', 'humidity', 'air_temp', 'lat_long',
       'lat_long', 'averagespl', 'carbonmonoxide', 'humidity_1', 'humidity_2',
       'ibatt', 'nitrogendioxide', 'ozone', 'particulateserr',
       'particulatesvsn', 'peakspl', 'pm1', 'pm10', 'pm25', 'temperature',
       'vbatt', 'vpanel']

In [None]:
cleaned_df = new_df.drop(columns=['time_2','lat_long', 'lat_long_2','humidity','humidity_2'])

In [None]:
cleaned_df.head()

#### checking the distribution of some features

In [None]:

#trying to check the distribution of 
plt.figure(figsize=(16,3))
plt.subplot(1,3,1)
cleaned_df['atm_pressure'].hist()
plt.title('atm_pressure')
plt.ylabel('counts')
plt.xlabel('pressure level')

plt.subplot(1,3,2)
cleaned_df['ozone'].hist()
plt.title('ozone')
plt.ylabel('counts')
plt.xlabel('ozone level')

plt.subplot(1,3,3)
cleaned_df['air_temp'].hist()
plt.title('air_temp')
plt.ylabel('counts')
plt.xlabel('temp level')

In [None]:
#####fillin up the nan with the their median
for column in cleaned_df:
    if pd.api.types.is_numeric_dtype(cleaned_df[column]):
        cleaned_df[column].fillna(cleaned_df[column].median(), inplace=True)

# Display the DataFrame after filling NA with median
cleaned_df.head()

In [None]:
cleaned_df.columns

In [None]:
cleaned_df.isna().sum()*100/len(cleaned_df)

In [None]:

cleaned_df['lat_long_1'] = cleaned_df['lat_long_1'].fillna(cleaned_df['lat_long_1'].mode())

In [None]:
mode_value = cleaned_df['lat_long_1'].mode()[0]

cleaned_df['lat_long_1'] = cleaned_df['lat_long_1'].fillna(mode_value)

In [None]:
cleaned_df.isna().sum()*100/len(cleaned_df)

In [None]:
cleaned_df.tail()

##### ANalysis

In [None]:
cleaned_df.describe()

#### Variation od Avg_Wind_speed and Gust_Wind_speed

In [None]:

# Convert 'time_1' to datetime
cleaned_df['time_1'] = pd.to_datetime(cleaned_df['time_1'])

# Extract the month
cleaned_df['month'] = cleaned_df['time_1'].dt.month

# Calculate monthly means
monthly_avg = cleaned_df.groupby('month')[['avg_wind_speed', 'gust_wind_speed']].mean()

# Ensure all months are represented, even if some months might not have data
all_months = pd.DataFrame({'month': np.arange(1, 13)})
monthly_avg = all_months.merge(monthly_avg, on='month', how='left').set_index('month')

# Set the background style to dark
sns.set_style('darkgrid')
plt.style.use('dark_background')

# Plot monthly means
plt.figure(figsize=(12, 5))
ax = monthly_avg.plot(kind='line', marker='o', linestyle='-', ax=plt.gca(), color=['#1f77b4', '#ff7f0e'])
plt.title('Seasonal Variation of Wind Speeds', fontsize=20, fontweight='bold', color='white')
plt.xlabel('Month', fontsize=15, color='white')
plt.ylabel('Wind Speed', fontsize=15, color='white')
plt.xticks(ticks=np.arange(12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], color='white')
plt.yticks(color='white')
plt.legend(['Average Wind Speed', 'Gust Wind Speed'], fontsize=12)

# Customize the plot appearance
ax.spines['bottom'].set_color('white')
ax.spines['left'].set_color('white')
ax.tick_params(colors='white')
ax.grid(True, linestyle='--', alpha=0.6)

plt.tight_layout()
plt.show()


The graph shows the seasonal variation of average wind speed and gust wind speed throughout the year. In February and September, both average and gust wind speeds peak, indicating stronger winds during these months. Conversely, December and October experience significant drops in wind speeds, with December starting low and October having a notable decline after the September peak.
This  behavior during these periods may be due to changes seen in the winter and autumn months.

#### Impact of Vapour Pressure on Humidity

In [None]:

# Set up the plotting area
plt.figure(figsize=(12, 5))
plt.style.use('dark_background')

# Create scatter plot with a gradient based on density
sns.scatterplot(x='vapourpressure', y='humidity_1', data=cleaned_df, hue='vapourpressure', palette='viridis')

# Set plot title and labels
plt.title('Impact of Vapour Pressure on Humidity Levels', color='white')
plt.xlabel('Vapour Pressure', color='white')
plt.ylabel('Humidity', color='white')
plt.legend(title='Vapour Pressure', loc='upper right')

# Show the plot
plt.show()


The scatter plot shows how vapour pressure and humidity levels are related, using different colors to represent various vapour pressure values. Higher vapour pressure, shown in green and yellow, usually goes along with higher humidity levels. Most data points fall between 1.0 and 2.5 vapour pressure units, with humidity levels ranging from 0% to 100%.

When humidity levels are high (above 60%), vapour pressure values are spread out but tend to be higher. On the other hand, when humidity is low (below 40%), vapour pressure is generally lower, shown in purple and blue. This pattern suggests that higher humidity often means higher vapour pressure and more concentrated vapour pressure, though there is some variation.

####  Patterns in Air Temperature, Humidity, and Pressure

In [None]:
plt.figure(figsize=(12, 6))
cleaned_df[['air_temp', 'humidity_1', 'atm_pressure']].plot()
plt.title('Temporal Patterns in Air Temperature, Humidity, and Atmospheric Pressure')
plt.xlabel('Time')
plt.ylabel('Values')
plt.legend(['Air Temperature', 'Humidity', 'Atmospheric Pressure'])
plt.show()

The graph shows a moist environment as a result of the consistent high humidity levels.
The air temperature, fluctuates a lot, suggesting regular cycles like daily or seasonal changes.
The atmospheric pressure remains stable, which might indicate that significant weather changes might not be frequent.

#### patterns of pollutants across the year

In [None]:

plt.figure(figsize=(20, 9))
plt.style.use('dark_background')

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Plot for Carbon Monoxide
plt.subplot(2, 2, 1)
plt.scatter(cleaned_df['month'], cleaned_df['carbonmonoxide'], color='cyan', alpha=0.5)
plt.title('Carbon Monoxide Concentration by Month', color='white')
plt.xlabel('Month', color='white')
plt.ylabel('Carbon Monoxide (CO)', color='white')
plt.xticks(range(1, 13), months, rotation=45, color='white')
plt.yticks(color='white')

# Plot for Nitrogen Dioxide
plt.subplot(2, 2, 2)
plt.scatter(cleaned_df['month'], cleaned_df['nitrogendioxide'], color='magenta', alpha=0.5)
plt.title('Nitrogen Dioxide Concentration by Month', color='white')
plt.xlabel('Month', color='white')
plt.ylabel('Nitrogen Dioxide (NO2)', color='white')
plt.xticks(range(1, 13), months, rotation=45, color='white')
plt.yticks(color='white')

# Plot for PM10
plt.subplot(2, 2, 3)
plt.scatter(cleaned_df['month'], cleaned_df['pm10'], color='yellow', alpha=0.5)
plt.title('PM10 Concentration by Month', color='white')
plt.xlabel('Month', color='white')
plt.ylabel('PM10', color='white')
plt.xticks(range(1, 13), months, rotation=45, color='white')
plt.yticks(color='white')

# Plot for PM2.5
plt.subplot(2, 2, 4)
plt.scatter(cleaned_df['month'], cleaned_df['pm25'], color='lime', alpha=0.5)
plt.title('PM2.5 Concentration by Month', color='white')
plt.xlabel('Month', color='white')
plt.ylabel('PM2.5', color='white')
plt.xticks(range(1, 13), months, rotation=45, color='white')
plt.yticks(color='white')

# Adjust layout
plt.tight_layout()

# Show plot
plt.show()


Carbon Monoxide (CO) and Nitrogen Dioxide (NO2):

These pollutants have levels that go up and down every month, but there are no clear trends showing they are increasing or decreasing over time. This suggests that the sources of these pollutants, like cars and factories, are consistent throughout the year.

PM10 and PM2.5:
In terms pf PM10(particulate matter) levels also fluctuate each month without a clear trend, which might be due to weather changes or different activities.
PM2.5(particulate matter) levels are stable and low throughout the year, indicating that efforts to control this pollutant might be effective.

#### Efect of pollutants across different years

In [None]:
plt.figure(figsize=(15, 6))
plt.scatter(cleaned_df.index, cleaned_df['ozone'], label='Ozone', color='#e377c2', alpha=0.6, s=10)
plt.title('Ozone Levels Over Time', fontsize=20, fontweight='bold', color='white')
plt.xlabel('Time', fontsize=15, color='white')
plt.ylabel('Ozone Levels', fontsize=15, color='white')
plt.xticks(color='white')
plt.yticks(color='white')
plt.legend(fontsize=12)
plt.grid(True, linestyle='--', alpha=0.6)
plt.style.use('dark_background')
plt.show()

The graph shows that ozone levels have been fairly consistent from 2021 to 2024, with most measurements densely clustered between 100 and 200 units. This indicates that the pollutant levels have remained stable over these years. The consistency in ozone levels suggests that the sources of pollution have not changed significantly.

If this trend continues, it could have a lasting impact on air quality and health in the future. Monitoring and addressing these  pollutant levels are important to prevent negative effects.

### Feature engineering
##### - removing the outlier
##### - finding the correlation
##### -getting the truth label for the dataset
##### -normalizing the dataset

In [None]:

outliers_dict = {}

# Iterate through columns and process only numeric columns
for column in cleaned_df.columns:
    if pd.api.types.is_numeric_dtype(cleaned_df[column]):
        # Calculate quantiles for the numeric column
        Q1 = cleaned_df[column].quantile(0.25)
        Q3 = cleaned_df[column].quantile(0.75)
        IQR = Q3 - Q1
        
        # Identify outliers in the numeric column
        outliers = ((cleaned_df[column] < (Q1 - 1.5 * IQR)) | (cleaned_df[column] > (Q3 + 1.5 * IQR))).sum()
        
        outliers_dict[column] = outliers
print(outliers_dict)


In [None]:
from scipy.stats import mstats
# Define Winsorization limits
lower_limit = 0.005
upper_limit = 0.005

# Apply Winsorization to the DataFrame
cleaned_df_win = cleaned_df.copy()

for column in cleaned_df.columns:
    if column != 'time_1':  # Skip the datetime column
        cleaned_df_win[column] = mstats.winsorize(cleaned_df[column], limits=(lower_limit, upper_limit))

cleaned_df_win.head()


In [None]:
cleaned_df_win.describe()

In [None]:
cleaned_df_win.head()

#### getting the correlation of the dataframe

In [None]:

def corrheatmap(R, labels):
    """
    Draws a correlation heat map, given:
    * R - matrix of correlation coefficients for all variable pairs,
    * labels - list of column names
    """
    assert R.shape[0] == R.shape[1] and R.shape[0] == len(labels)
    k = R.shape[0]

    # plot the heat map using a custom colour palette
    # (correlations are in [-1, 1])
    plt.imshow(R, cmap=plt.get_cmap("RdBu"), vmin=-1, vmax=1)

    # add text labels
    for i in range(k):
        for j in range(k):
            plt.text(j, i, f"{R[i, j]:.2f}", ha="center", va="center",
                     color="black" if np.abs(R[i, j]) < 0.5 else "white")

    plt.xticks(np.arange(k), labels=labels, rotation=30, ha='right')
    plt.tick_params(axis="x", which="both",
                    labelbottom=True, labeltop=False, bottom=False, top=False)

    plt.yticks(np.arange(k), labels=labels)
    plt.tick_params(axis="y", which="both",
                    labelleft=True, left=False, right=False)

    plt.grid(False)
    plt.colorbar()

# Specify columns to exclude
columns_to_exclude = ['time_1', 'lat_long_1', 'month', 'command', 'battery', 'peakspl',
                     'ibatt', 'averagespl','rtc','solar','solarpanel']  # replace with your column names

# Drop the specified columns
cleaned_df_reduced = cleaned_df.drop(columns=columns_to_exclude)

# Calculate the correlation matrix
correlation_matrix = cleaned_df_reduced.corr()

# Get the column labels
labels = correlation_matrix.columns

# Plot the correlation heatmap
plt.figure(figsize=(15, 10))
corrheatmap(correlation_matrix.values, labels)
plt.title('Pearson Correlation Heatmap', fontsize=16, fontweight='bold')
plt.show()


#### important columns needed

In [None]:
cleaned_df_win.columns

In [None]:
final_df = cleaned_df_win[['time_1', 'lat_long_1', 'precipitation', 'strikes', 'avg_wind_speed',
       'avg_wind_direction', 'gust_wind_speed', 'vapourpressure',
       'atm_pressure', 'air_temp', 'carbonmonoxide',
       'humidity_1', 'nitrogendioxide', 'ozone', 'particulateserr',
       'particulatesvsn', 'pm1', 'pm10', 'pm25', 'temperature']]
final_df['ozone'] = final_df['ozone'].apply(lambda x: max(x, 0))

In [None]:
final_df.head()

In [None]:
final_df.describe()

#### creating a threshold for the Truth_label

In [None]:
def classify_weather(row):
    # Define thresholds based on typical values for sunny weather
    if (#row['atm_pressure'] > 101.2 and  # Above average atmospheric pressure
        row['humidity_1'] < 40 and  # Lower humidity
        row['temperature'] > 23):  # Higher temperature
        return 'sunny'
    
    # Define thresholds for rainy weather
    elif (#row['atm_pressure'] < 100.7 and  # Below average atmospheric pressure
          row['humidity_1'] >= 70 and  # Higher humidity
          row['temperature'] <= 21):  # Lower temperature
        return 'rainy'
    
    # Define thresholds for cloudy weather
    elif (#100.7 < row['atm_pressure'] < 101.2 and  # Intermediate atmospheric pressure
          30 < row['humidity_1'] < 75 and  # Intermediate humidity
          14 < row['temperature'] < 22):  # Intermediate temperature
        return 'cloudy'
    
    # Define thresholds for stormy weather
    elif (#row['atm_pressure'] < 100.0 and  # Very low atmospheric pressure
          row['humidity_1'] > 85 and  # Very high humidity
          row['temperature'] > 20):  # Variable temperature
        return 'stormy'
    
    # Default case if no specific condition is met
    return 'sunny'

final_df['status'] = final_df.apply(classify_weather, axis=1)
final_df.head()


In [None]:
final_df['status'].value_counts()

#### Getting The percentage for each target label and visualizing it

In [None]:
label_distribution = final_df['status'].value_counts()
total_samples = final_df['status'].count()
label_percentage = (label_distribution/total_samples)*100
print(label_percentage)

In [None]:

plt.figure(figsize=(8,3))
final_df['status'].hist()
plt.title(' The Weather Label',fontsize=16, fontweight='bold', color='white')
plt.xlabel('target label', fontsize=12, color='white')
plt.ylabel('counts', fontsize=12, color='white')

#### Normalizing the dataset without the target label

In [None]:

# Drop the non-numeric columns
norm_column = final_df.iloc[:, 2:-1]
scaler = MinMaxScaler()

norm_column_scaled = scaler.fit_transform(norm_column)

# Convert the normalized data back to a DataFrame with original column names
norm_column_scaled = pd.DataFrame(norm_column_scaled, columns=norm_column.columns)

norm_column_scaled.head()


###  features selection

In [None]:
first_column = final_df.iloc[:,0]
last_column = final_df.iloc[:,-1]
first_column.head()

In [None]:
#combining my features needed for my model
feat_df = pd.concat([first_column, norm_column_scaled, last_column], axis=1)
feat_df.head()

#### encoding my truth_label

In [None]:
from sklearn.preprocessing import LabelEncoder

# Select the last column
last_column = feat_df.columns[-1]

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the last column, and add the encoded values as a new column
feat_df[last_column + '_encoded'] = label_encoder.fit_transform(feat_df[last_column])
feat_df.head()


#### using RandomForestRegressor to determine the features that contributed to high rate of Ozone

In [None]:
X = feat_df.drop(columns=['time_1','status', 'ozone','status_encoded'])
Y = feat_df['ozone']

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size= 0.3, random_state = 42)

In [None]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators = 100, random_state= 42)

net = model.fit(X_train, Y_train)


In [None]:
#get features importance
importance = model.feature_importances_
feature_names = X.columns
#Create a DataFrame for visualization
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot feature importances
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.xlabel('Importance', fontsize=15)
plt.ylabel('Feature', fontsize=15)
plt.title('Feature Importances for Ozone Levels', fontsize=20, fontweight='bold')
plt.gca().invert_yaxis()
plt.style.use('dark_background')
plt.show()

In [None]:
x = feat_df.drop(columns=['time_1', 'status','status_encoded'])
y = feat_df['status_encoded']

In [None]:
from sklearn.manifold import TSNE

# Apply t-SNE
#tsne = TSNE(n_components=2, random_state=42)
#X_tsne = tsne.fit_transform(X_scaled)

tsne = TSNE(n_components =2, random_state= 42)
X_tsne = tsne.fit_transform(x)

# Plotting the t-SNE results
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')

# Adding a legend with labels
unique_labels = list(feat_df['status_encoded'].unique())
plt.legend(handles=scatter.legend_elements()[0], labels=unique_labels)
plt.title('t-SNE Visualization of Iris Dataset')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()


.In this plot, we can see that the data points, represented by different colors for each class, are spread out and overlap in some areas rather than forming clear, separate clusters.This overlapping indicates that the data might not be easily separable using a simple linear boundary, suggesting that a non-linear model might be more effective for classification tasks on this dataset.



#### Balancing the label using Resampling Technique by using combination of sampling called smote

In [None]:
x = feat_df.drop(columns=['time_1', 'status','status_encoded'])
y = feat_df['status_encoded']

In [None]:
#applying smote
smote = SMOTE(random_state = 42)

x_smote, y_smote = smote.fit_resample(x, y)


In [None]:
plt.figure(figsize=(8,3))
y_smote.hist()
plt.title(' The Weather Label',fontsize=16, fontweight='bold', color='white')
plt.xlabel('target label', fontsize=12, color='white')
plt.ylabel('counts', fontsize=12, color='white')

#### dataset splitting for model training

In [None]:
#split data for training

x_train, x_test, y_train, y_test = train_test_split(x_smote, y_smote, test_size=0.2, random_state=42)


In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

#### using a Logistic Regression

In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

#intialize logistic regresiom
log_reg = LogisticRegression(penalty = 'l1', solver='liblinear', C=0.2)

#cross_validation
#cv = StratifiedKFold(n_splits =5, shuffle = True, random_state = 42)

#perform cross val
scores = cross_val_score(log_reg, x_train, y_train, cv=5, scoring='accuracy')

#scores = cross_val_score(log_reg, x, y, cv=cv, scoring='accuracy')
print(f'Cross-validated accuracy scores: {scores}')
print(f'Average accuracy: {scores.mean():.2f}')

log_reg.fit(x_train, y_train)

#evaluation
training_accuracy = log_reg.score(x_train,y_train)
testing_accuracy = log_reg.score(x_test, y_test)

print(f"training_accuracy : {training_accuracy}")
print(f"testing_accuracy : {testing_accuracy}")



In [None]:

# Cross-validated accuracy scores and mean accuracy for Logistic Regression
scores_log_reg = [0.75065131, 0.74761187, 0.75005713, 0.74957722, 0.74898304]
mean_accuracy_log_reg = np.mean(scores_log_reg)

# Training and testing accuracy for Logistic Regression
training_accuracy_log_reg = 0.7493029845971022
testing_accuracy_log_reg = 0.7476690189769278

# Plotting the accuracies for Logistic Regression without grid lines
plt.figure(figsize=(8, 4))
plt.plot(range(1, 6), scores_log_reg, marker='o', linestyle='-', color='skyblue', label='Cross-validated Accuracy')
plt.axhline(y=training_accuracy_log_reg, color='g', linestyle='-', label=f'Training Accuracy: {training_accuracy_log_reg:.2f}')
plt.axhline(y=testing_accuracy_log_reg, color='b', linestyle='-', label=f'Testing Accuracy: {testing_accuracy_log_reg:.2f}')
plt.title('Logistic Regression Classifier Accuracy')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.grid(False)  # Disable grid lines
plt.show()


#### Using support Vector Machine

In [None]:
x_train_svm, x_test_svm, y_train_svm, y_test_svm= train_test_split(x_smote, y_smote, test_size=0.2, random_state=42)

svm_clf = SVC(kernel = 'rbf', C= 0.2, gamma='scale', random_state = 42) #using rbf because is non-linear

scores_svm = cross_val_score(svm_clf, x_train_svm, y_train_svm, cv=5, scoring= 'accuracy')
print(f'Cross-validated accuracy scores: {scores_svm}')
print(f'Average accuracy: {scores_svm.mean():.2f}')

#fitting svm
svm_clf.fit(x_train_svm, y_train_svm)

training_accuracy_svm = svm_clf.score(x_train_svm,y_train_svm)
testing_accuracy_svm = svm_clf.score(x_test_svm, y_test_svm)

print(f"training_accuracy : {training_accuracy_svm}")
print(f"testing_accuracy : {testing_accuracy_svm}")

In [None]:
# Cross-validated accuracy scores and mean accuracy for SVM
scores_svm = [0.94325609, 0.94321038, 0.94309612, 0.94122218, 0.94430733]
mean_accuracy_svm = np.mean(scores_svm)

# Training and testing accuracy for SVM
training_accuracy_svm = 0.9478221125279949
testing_accuracy_svm = 0.9467987860616476

# Plotting the accuracies for SVM without grid lines
plt.figure(figsize=(8, 4))
plt.plot(range(1, 6), scores_svm, marker='o', linestyle='-', color='skyblue', label='Cross-validated Accuracy')
plt.axhline(y=training_accuracy_svm, color='g', linestyle='-', label=f'Training Accuracy: {training_accuracy_svm:.2f}')
plt.axhline(y=testing_accuracy_svm, color='b', linestyle='-', label=f'Testing Accuracy: {testing_accuracy_svm:.2f}')
plt.title('SVM Classifier Accuracy')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.grid(False)  # Disable grid lines
plt.show()


#### using Decission Tree

In [None]:
x_train_dt, x_test_dt, y_train_dt, y_test_dt= train_test_split(x_smote, y_smote, test_size=0.42,random_state=42)

dt_clf = DecisionTreeClassifier(max_depth = 2, class_weight='balanced', criterion='gini',random_state=42)

score_dt = cross_val_score(dt_clf, x_train_dt, y_train_dt, cv=5,scoring='accuracy')

print(f'cross-validated accuracy scores: {score_dt}')
print(f'Average accuracy: {score_dt.mean():.2f}')

#fiiting decission tree
dt_clf.fit(x_train_dt, y_train_dt)

training_accuracy_dt = dt_clf.score(x_train_dt, y_train_dt)
testing_accuracy_dt = dt_clf.score(x_test_dt, y_test_dt)

print(f"training_accuracy : {training_accuracy_dt}")
print(f"testing_accuracy : {testing_accuracy_dt}")

In [None]:
# Cross-validated accuracy scores and mean accuracy for Decision Tree
score_dt = [0.87703704, 0.87836091, 0.87675335, 0.87823099, 0.87823099]
mean_accuracy_dt = np.mean(score_dt)

# Training and testing accuracy for Decision Tree
training_accuracy_dt = 0.8777289548173972
testing_accuracy_dt = 0.8778740260305576

# Plotting the accuracies for Decision Tree without grid lines
plt.figure(figsize=(8, 4))
plt.plot(range(1, 6), score_dt, marker='o', linestyle='-', color='skyblue', label='Cross-validated Accuracy')
plt.axhline(y=training_accuracy_dt, color='g', linestyle='-', label=f'Training Accuracy: {training_accuracy_dt:.2f}')
plt.axhline(y=testing_accuracy_dt, color='b', linestyle='-', label=f'Testing Accuracy: {testing_accuracy_dt:.2f}')
plt.title('Decision Tree Classifier Accuracy')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.grid(False)
plt.show()


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Cross-validated accuracy scores and mean accuracy for Decision Tree
score_dt = [0.87703704, 0.87836091, 0.87675335, 0.87823099, 0.87823099]

# Training and testing accuracy for Decision Tree
training_accuracy_dt = 0.8777289548173972
testing_accuracy_dt = 0.8778740260305576

# Plotting the accuracies without grid lines and without the average line
plt.figure(figsize=(8, 4))
plt.plot(range(1, 6), score_dt, marker='o', linestyle='-', color='skyblue', label='Cross-validated Accuracy')
plt.axhline(y=training_accuracy_dt, color='g', linestyle='-', label=f'Training Accuracy: {training_accuracy_dt:.2f}')
plt.axhline(y=testing_accuracy_dt, color='b', linestyle='-', label=f'Testing Accuracy: {testing_accuracy_dt:.2f}')
plt.title('Decision Tree Classifier Accuracy')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.grid(False)
plt.show()


#### Using AdaBoostClassifier

In [None]:
x_train_ad, x_test_ad, y_train_ad, y_test_ad= train_test_split(x_smote, y_smote, test_size=0.2, random_state=42)

ad_clf = AdaBoostClassifier(n_estimators=100, random_state=42)

score_ad = cross_val_score(ad_clf, x_train_ad, y_train_ad, cv=5, scoring='accuracy')


print(f'cross-validated accuracy scores: {score_ad}')
print(f'Average accuracy: {score_ad.mean():.2f}')

ad_clf.fit(x_train, y_train)

training_accuracy_ad = ad_clf.score(x_train_ad, y_train_ad)
testing_accuracy_ad = ad_clf.score(x_test_ad, y_test_ad)

print(f"training_accuracy : {training_accuracy_ad}")
print(f"testing_accuracy : {testing_accuracy_ad}")

In [None]:

# Cross-validated accuracy scores and mean accuracy for AdaBoost
score_ad = [0.83932081, 0.84425705, 0.84446273, 0.84473696, 0.84325152]
mean_accuracy_ad = np.mean(score_ad)

# Training and testing accuracy for AdaBoost
training_accuracy_ad = 0.843973673385438
testing_accuracy_ad = 0.8436505905151925

# Plotting the accuracies for AdaBoost without grid lines
plt.figure(figsize=(8, 4))
plt.plot(range(1, 6), score_ad, marker='o', linestyle='-', color='skyblue', label='Cross-validated Accuracy')
plt.axhline(y=training_accuracy_ad, color='g', linestyle='-', label=f'Training Accuracy: {training_accuracy_ad:.2f}')
plt.axhline(y=testing_accuracy_ad, color='b', linestyle='-', label=f'Testing Accuracy: {testing_accuracy_ad:.2f}')
plt.title('AdaBoost Classifier Accuracy')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.grid(False)
plt.show()


#### Using Random Forest Classifier

In [None]:
x_train_rf, x_test_rf, y_train_rf, y_test_rf = train_test_split(x_smote, y_smote, test_size=0.2,random_state=42)

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

score_rf = cross_val_score(rf_clf,x_train_rf, y_train_rf, cv=5, scoring='accuracy')

print(f'cross-validated accuracy score : {score_rf}')
print(f'average accuracy : {score_rf.mean():.2f}')

rf_clf.fit(x_train_rf,y_train_rf)

training_accuracy_rf = rf_clf.score(x_train_rf, y_train_rf)
testing_accuracy_rf = rf_clf.score(x_test_rf, y_test_rf)

print(f"training_accuracy : {training_accuracy_rf}")
print(f"testing_accuracy : {testing_accuracy_rf}")

In [None]:

# Cross-validated accuracy scores and mean accuracy for RandomForest
score_rf = [0.99933726, 0.99956579, 0.99949723, 0.99926871, 0.99954294]
mean_accuracy_rf = np.mean(score_rf)

# Training and testing accuracy for RandomForest
training_accuracy_rf = 1.0
testing_accuracy_rf = 0.9997997649414531 

# Plotting the accuracies for RandomForest
plt.figure(figsize=(8, 4))
plt.plot(range(1, 6), score_rf, marker='o', linestyle='-', label='Cross-validated Accuracy')
#plt.axhline(y=mean_accuracy_rf, color='r', linestyle='--', label=f'Average CV Accuracy: {mean_accuracy_rf:.2f}')
plt.axhline(y=training_accuracy_rf, color='g', linestyle='-', label=f'Training Accuracy: {training_accuracy_rf:.2f}')
plt.axhline(y=testing_accuracy_rf, color='b', linestyle='-', label=f'Testing Accuracy: {testing_accuracy_rf:.2f}')
plt.title('Random Forest Classifier Accuracy')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend(loc='best')
plt.grid(False)
plt.show()


##### Comparing the performance of all models, the Random Forest classifier stands out as the best model for this classification task. It achieved the  highest cross-validated accuracy scores, averaging close to 1.00, and maintained perfect training accuracy (1.0) along with a nearly perfect testing accuracy (0.9998). These results suggest that the Random Forest model is extremely robust and effective at capturing the patterns in the data, significantly outperforming other models like SVM, Decision Tree, AdaBoost, and Logistic Regression in terms of both accuracy and consistency.
##### However, the perfect or near-perfect scores also suggest that it might be overfitting slightly to the training data, but its high testing accuracy indicates it generalizes well to unseen data. This makes Random Forest the optimal choice among the evaluated models for its superior performance.

### predicting with random Forest Classifier

***

_**DELETE BEFORE PUBLISHING**_

## Style guide for use cases

### Headers

For styling within your markdown cells, there are two choices you can use for headers.

1) You can use HTML classes specific to the use case styling:

```<p class="usecase-subsection-header">This is a subsection header.</p>```

<p style="font-weight: bold; font-size: 1.2em;">This is a subsection header.</p>

```<p class="usecase-subsection-blurb">This is a blurb header.</p>```

<p style="font-weight: bold; font-size: 1em; font-style:italic;">This is a blurb header.</p>


2) Or if you like you can use the markdown header styles:

```# for h1```

```## for h2```

```### for h3```

```#### for h4```

```##### for h5```

## Plot colour schemes

General advice:
1. Use the same colour or colour palette throughout your notebook, unless variety is necessary
2. Select a palette based on the type of data being represented
3. Consider accessibility (colourblindness, low vision)

#### 1) If all of your plots only use 1-2 colors use one of the company style colors:

| Light theme | Dark Theme |
|-----|-----|
|<p style="color:#2af598;">#2af598</p>|<p style="color:#08af64;">#08af64</p>|
|<p style="color:#22e4ac;">#22e4ac</p>|<p style="color:#14a38e;">#14a38e</p>|
|<p style="color:#1bd7bb;">#1bd7bb</p>|<p style="color:#0f9295;">#0f9295</p>|
|<p style="color:#14c9cb;">#14c9cb</p>|<p style="color:#056b8a;">#056b8a</p>|
|<p style="color:#0fbed8;">#0fbed8</p>|<p style="color:#121212;">#121212</p>|
|<p style="color:#08b3e5;">#08b3e5</p>||


#### 2) If your plot needs multiple colors, choose an appropriate palette using either of the following tutorials:
- https://seaborn.pydata.org/tutorial/color_palettes.html
- https://matplotlib.org/stable/tutorials/colors/colormaps.html

#### 3) Consider accessibility as well.

For qualitative plotting Seaborn's 'colorblind' palette is recommended. For maps with sequential or diverging it is recommended to use one of the Color Brewer schemes which can be previewed at https://colorbrewer2.org/.

If you want to design your own colour scheme, it should use the same principles as Cynthia Brewer's research (with variation not only in hue but also, saturation or luminance).

### References

Be sure to acknowledge your sources and any attributions using links or a reference list.

If you have quite a few references, you might wish to have a dedicated section for references at the end of your document, linked using footnote style numbers.

You can connect your in-text reference by adding the number with a HTML link: ```<a href="#fn-1">[1]</a>```

and add a matching ID in the reference list using the ```<fn>``` tag: ```<fn id="fn-1">[1] Author (Year) _Title_, Publisher, Publication location.</fn>```