<a href="https://colab.research.google.com/github/Anand14-web/AIML-2025/blob/main/2303A51310_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

27. Prediction of Air Qualities in Italian Cities
    1. Identify the top 5 reasons for air quality
    2. Identity the Day of week with most air quantity issues
    3.Find the max and min air quality levels
    4. identify the highest and lowest temparatures of air quality
    5. Identify the highest educational qualification of the employees.
    6. Apply either Classification Model or Clusturing Model to evaluate the dataset

1.)
To predict air quality in Italian cities, you can analyze air quality data and identify factors influencing air quality. The top 5 reasons for air quality degradation typically include:

    1.Vehicular Emissions: High vehicle density leads to significant pollutant emissions (e.g., NOx, CO, and particulate matter).
    2.Industrial Emissions: Factories and industries emit pollutants, especially in areas with manufacturing plants.
    3.Domestic Heating: Wood and coal-based heating systems contribute to particulate matter.
    4.Geography and Weather: Geographical factors (like valleys) and weather conditions (low wind speeds, temperature inversions) trap pollutants.
    5.Agricultural Practices: Use of fertilizers and burning of crop residues can release harmful chemicals into the air.

In [8]:
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo

# fetch dataset
air_quality = fetch_ucirepo(id=360)

# data (as pandas dataframes)
X = air_quality.data.features
y = air_quality.data.targets

# metadata
print(air_quality.metadata)

# variable information
print(air_quality.variables)


Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
{'uci_id': 360, 'name': 'Air Quality', 'repository_url': 'https://archive.ics.uci.edu/dataset/360/air+quality', 'data_url': 'https://archive.ics.uci.edu/static/public/360/data.csv', 'abstract': 'Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. ', 'area': 'Computer Science', 'tasks': ['Regression'], 'characteristics': ['Multivariate', 'Time-Series'], 'num_instances': 9358, 'num_features': 15, 'feature_types': ['Real'], 'demographics': [], 'target_col': None, 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2008, 'last_updated': 'Sun Mar 10 2024', 'dataset_doi': '10.2

In [9]:
from ucimlrepo.fetch import *

In [10]:
class dotdict(dict):
    """dot.notation access to dictionary attributes"""
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

In [11]:
import json
import pandas as pd
from typing import Optional
import urllib.request
import urllib.parse
import certifi
import ssl

from ucimlrepo.dotdict import dotdict


# constants

# API endpoints
API_BASE_URL = 'https://archive.ics.uci.edu/api/dataset'
API_LIST_URL = 'https://archive.ics.uci.edu/api/datasets/list'

# base location of data csv files
DATASET_FILE_BASE_URL = 'https://archive.ics.uci.edu/static/public'

# available categories of datasets to filter by
VALID_FILTERS = ['aim-ahead']


# custom exception for no dataset found during fetch_ucirepo
class DatasetNotFoundError(Exception):
    pass


def fetch_ucirepo(
        name: Optional[str] = None,
        id: Optional[int] = None
    ):
    '''
    Loads a dataset from the UCI ML Repository, including the dataframes and metadata information.

    Parameters:
        id (int): Dataset ID for UCI ML Repository
        name (str): Dataset name, or substring of name
        (Only provide id or name, not both)

    Returns:
        result (dotdict): object containing dataset metadata, dataframes, and variable info in its properties
    '''

    # check that only one argument is provided
    if name and id:
        raise ValueError('Only specify either dataset name or ID, not both')

    # validate types of arguments and add them to the endpoint query string
    api_url = API_BASE_URL
    if name:
        if type(name) != str:
            raise ValueError('Name must be a string')
        api_url += '?name=' + urllib.parse.quote(name)
    elif id:
        if type(id) != int:
            raise ValueError('ID must be an integer')
        api_url += '?id=' + str(id)
    else:
        # no arguments provided
        raise ValueError('Must provide a dataset name or ID')


    # fetch metadata from API
    data = None
    try:
        response = urllib.request.urlopen(api_url, context=ssl.create_default_context(cafile=certifi.where()))
        data = json.load(response)
    except (urllib.error.URLError, urllib.error.HTTPError):
        raise ConnectionError('Error connecting to server')

    # verify that dataset exists
    if data['status'] != 200:
        error_msg = data['message'] if 'message' in data else 'Dataset not found in repository'
        raise DatasetNotFoundError(error_msg)


    # extract ID, name, and URL from metadata
    metadata = data['data']
    if not id:
        id = metadata['uci_id']
    elif not name:
        name = metadata['name']

    data_url = metadata['data_url']

    # no data URL means that the dataset cannot be imported into Python
    # i.e. it does not yet have a standardized CSV file for pandas to parse
    if not data_url:
        raise DatasetNotFoundError('"{}" dataset (id={}) exists in the repository, but is not available for import. Please select a dataset from this list: https://archive.ics.uci.edu/datasets?skip=0&take=10&sort=desc&orderBy=NumHits&search=&Python=true'.format(name, id))


    # parse into dataframe using pandas
    df = None
    try:
        df = pd.read_csv(data_url)
    except (urllib.error.URLError, urllib.error.HTTPError):
        raise DatasetNotFoundError('Error reading data csv file for "{}" dataset (id={}).'.format(name, id))

    if df.empty:
        raise DatasetNotFoundError('Error reading data csv file for "{}" dataset (id={}).'.format(name, id))


    # header line should be variable names
    headers = df.columns

    # feature information, class labels
    variables = metadata['variables']
    del metadata['variables']      # moved from metadata to a separate property

    # organize variables into IDs, features, or targets
    variables_by_role = {
        'ID': [],
        'Feature': [],
        'Target': [],
        'Other': []
    }
    for variable in variables:
        if variable['role'] not in variables_by_role:
            raise ValueError('Role must be one of "ID", "Feature", or "Target", or "Other"')
        variables_by_role[variable['role']].append(variable['name'])

    # extract dataframes for each variable role
    ids_df = df[variables_by_role['ID']] if len(variables_by_role['ID']) > 0 else None
    features_df = df[variables_by_role['Feature']] if len(variables_by_role['Feature']) > 0 else None
    targets_df = df[variables_by_role['Target']] if len(variables_by_role['Target']) > 0 else None

    # place all varieties of dataframes in data object
    data = {
        'ids': ids_df,
        'features': features_df,
        'targets': targets_df,
        'original': df,
        'headers': headers,
    }

    # convert variables from JSON structure to tabular structure for easier visualization
    variables = pd.DataFrame.from_records(variables)

    # alternative usage?:
    # variables.age.role or variables.slope.description
    # print(variables) -> json-like dict with keys [name] -> details

    # make nested metadata fields accessible via dot notation
    metadata['additional_info'] = dotdict(metadata['additional_info']) if metadata['additional_info'] else None
    metadata['intro_paper'] = dotdict(metadata['intro_paper']) if metadata['intro_paper'] else None

    # construct result object
    result = {
        'data': dotdict(data),
        'metadata': dotdict(metadata),
        'variables': variables
    }

    # convert to dictionary with dot notation
    return dotdict(result)



def list_available_datasets(filter: Optional[str] = None, search: Optional[str] = None, area: Optional[str] = None):
    '''
    Prints a list of datasets that can be imported via fetch_ucirepo function

    Parameters:
        filter (str): Optional query to filter available datasets based on a label
        search (str): Optional query to search for available datasets by name
        area (str): Optional query to filter available datasets based on subject area

    Returns:
        None
    '''

    # validate filter input
    if filter:
        if type(filter) != str:
            raise ValueError('Filter must be a string')
        filter = filter.lower()

    # validate search input
    if search:
        if type(search) != str:
            raise ValueError('Search query must be a string')
        search = search.lower()

    # construct endpoint URL
    api_list_url = API_LIST_URL
    query_params = {}
    if filter:
        query_params['filter'] = filter
    else:
        query_params['filter'] = 'python'       # default filter should be 'python'
    if search:
        query_params['search'] = search
    if area:
        query_params['area'] = area

    api_list_url += '?' + urllib.parse.urlencode(query_params)

    # fetch list of datasets from API
    data = None
    try:
        response  = urllib.request.urlopen(api_list_url, context=ssl.create_default_context(cafile=certifi.where()))
        resp_json = json.load(response)
    except (urllib.error.URLError, urllib.error.HTTPError):
        raise ConnectionError('Error connecting to server')

    if resp_json['status'] != 200:
        error_msg = resp_json['message'] if 'message' in resp_json else 'Internal Server Error'
        raise ValueError(resp_json['message'])

    data = resp_json['data']

    if len(data) == 0:
        print('No datasets found')
        return

    # column width for dataset name
    maxNameLen = max(max([len(dataset['name']) for dataset in data]) + 3, 15)

    # print table title
    title = 'The following {}datasets are available{}:'.format(filter + ' ' if filter else '', ' for search query "{}"'.format(search) if search else '')
    print('-' * len(title))
    print(title)
    if area:
        print('For subject area: {}'.format(area))
    print('-' * len(title))

    # print table headers
    header_str = '{:<{width}} {:<6}'.format('Dataset Name', 'ID', width=maxNameLen)
    underline_str = '{:<{width}} {:<6}'.format('------------', '--', width=maxNameLen)
    if len(data) > 0 and 'description' in data[0]:
        header_str += ' {:<100}'.format('Prediction Task')
        underline_str += ' {:<100}'.format('---------------')
    print(header_str)
    print(underline_str)

    # print row for each dataset
    for dataset in data:
        row_str = '{:<{width}} {:<6}'.format(dataset['name'], dataset['id'], width=maxNameLen)
        if 'description' in dataset:
            row_str += ' {:<100}'.format(dataset['description'])
        print(row_str)

    print()

2.

    1.Load Data:

    Load air quality data from a CSV file.
    Ensure it contains a timestamp column and an AQI column (or similar pollutant measure).
    2.Extract Day of the Week:

    Convert the timestamp column to a datetime format and extract the day of the week.
    3.Aggregate Data:

    Group AQI values by days of the week and compute the average for each day.
    4.Identify the Worst Day:

    Find the day with the highest average AQI.
    5.Visualization:

    A bar chart visualizes the average AQI for each day, highlighting patterns.

In [20]:
pip install ucimlrepo



In [21]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
air_quality = fetch_ucirepo(id=360)

# data (as pandas dataframes)
X = air_quality.data.features
y = air_quality.data.targets

# metadata
print(air_quality.metadata)

# variable information
print(air_quality.variables)


{'uci_id': 360, 'name': 'Air Quality', 'repository_url': 'https://archive.ics.uci.edu/dataset/360/air+quality', 'data_url': 'https://archive.ics.uci.edu/static/public/360/data.csv', 'abstract': 'Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. ', 'area': 'Computer Science', 'tasks': ['Regression'], 'characteristics': ['Multivariate', 'Time-Series'], 'num_instances': 9358, 'num_features': 15, 'feature_types': ['Real'], 'demographics': [], 'target_col': None, 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2008, 'last_updated': 'Sun Mar 10 2024', 'dataset_doi': '10.24432/C59K5F', 'creators': ['Saverio Vito'], 'intro_paper': {'ID': 420, 'type': 'NATIVE', 'title': 'On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario', 'authors': 

3.)

    1.Load Data: The script assumes your dataset is a CSV file and includes an AQI column with numeric values representing air quality levels.
    2.Identify Max and Min Values:
    Use max() and idxmax() to get the maximum AQI value and the corresponding row.
    Use min() and idxmin() for the minimum AQI value.
    3.Output:
    The maximum and minimum AQI values are displayed along with their details (e.g., timestamp, location).

In [22]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


4.)

    1.Dataset Requirements:

    A column with temperature values (temperature).
    Optionally, a column with timestamps (timestamp) for context.
    2.Finding Maximum and Minimum Temperatures:

    max() and idxmax() are used to find the highest temperature and its corresponding row in the dataset.
    min() and idxmin() are used similarly for the lowest temperature.
    3.Output:

    The script displays the highest and lowest temperatures along with additional details like timestamp or location (if included in the dataset).

5.)

    1.Load the Dataset:

    A CSV file containing employee details, including an education column (qualification).
    2.Check Unique Qualifications:

    Ensures consistency in formatting, as inconsistent data (e.g., "Bachelors" vs. "Bachelor's") might require cleaning.
    3.Define Qualification Hierarchy:

    A predefined order of qualifications is used to determine the highest level. Customize the list based on your dataset.
    4.Categorize Qualifications:

    Assigns a categorical type with the hierarchy for ordered comparisons.
    5.Find the Highest Qualification:

    Identifies the maximum qualification and lists employees who hold it.

6.)

    Classification: Use if the dataset includes labeled data (e.g., predicting educational qualification based on features like age, department, etc.).
    Clustering: Use if the dataset lacks labels and you aim to group employees with similar characteristics.

In [26]:
pip install ucimlrepo



In [27]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
air_quality = fetch_ucirepo(id=360)

# data (as pandas dataframes)
X = air_quality.data.features
y = air_quality.data.targets

# metadata
print(air_quality.metadata)

# variable information
print(air_quality.variables)


{'uci_id': 360, 'name': 'Air Quality', 'repository_url': 'https://archive.ics.uci.edu/dataset/360/air+quality', 'data_url': 'https://archive.ics.uci.edu/static/public/360/data.csv', 'abstract': 'Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer. ', 'area': 'Computer Science', 'tasks': ['Regression'], 'characteristics': ['Multivariate', 'Time-Series'], 'num_instances': 9358, 'num_features': 15, 'feature_types': ['Real'], 'demographics': [], 'target_col': None, 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2008, 'last_updated': 'Sun Mar 10 2024', 'dataset_doi': '10.24432/C59K5F', 'creators': ['Saverio Vito'], 'intro_paper': {'ID': 420, 'type': 'NATIVE', 'title': 'On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario', 'authors': 