### Regis University

**MSDS688_X70: Artificial Intelligence**  
Master of Science in Data Science Program


#### Week 1: Rule-Based Systems for AI Agents

## Lecture: Week 1 - Introduction to Rule-Based Systems and Machine Learning Models

### Overview

Before diving into the machine learning models we'll be using in this assignment, it's important to understand where these models fit in the broader context of AI. One of the earliest approaches to AI was the **Rule-Based System**.

---

### Rule-Based Systems in AI

**Rule-Based Systems** are a form of AI where knowledge is encoded in the form of rules, often written as **if-then** statements. These systems rely on a predefined set of rules to make decisions or classify data.

#### How Rule-Based Systems Work:
- A set of **rules** is created manually by domain experts.
- Each rule follows the format:  
  `IF condition THEN action`.
- For example, a rule for a medical diagnosis system might look like this:  
  `IF fever > 101°F AND cough = true THEN diagnosis = flu`.
- The system checks each rule against the input data, and when a rule matches, the corresponding action or classification is triggered.

#### Advantages of Rule-Based Systems:
- **Interpretability**: Rules are easy to understand and explain.
- **Control**: The system's behavior is fully determined by the rules, giving precise control over outcomes.
  
However, Rule-Based Systems have some significant **limitations**:
- **Scalability**: Manually creating and managing a large number of rules is difficult.
- **Adaptability**: Rule-based systems cannot easily adapt to new data or situations they weren't programmed for.
  
#### Transition to Machine Learning
Machine learning models, in contrast to rule-based systems, do not rely on manually defined rules. Instead, they **learn patterns** from data. These models can generalize to new data and situations, making them more adaptable and scalable for complex tasks like classification, regression, and more.

---

### Introduction to Machine Learning Models

In this assignment, we are focusing on training and evaluating three different machine learning models:
1. **Random Forest**
2. **Logistic Regression**
3. **Support Vector Machines (SVM)**

The objective is to understand how these models work, how they differ, and how to evaluate their performance.

We'll start by reviewing the key concepts behind each model, followed by an explanation of how they make predictions, and then wrap up with a discussion of model evaluation techniques, specifically accuracy.

---

### 1. **Random Forest**

Random Forest is an ensemble learning method, which means it combines multiple models (in this case, decision trees) to improve performance. The core idea behind Random Forest is **bagging**, where multiple decision trees are trained on random subsets of the data, and their predictions are averaged to give the final result.

#### How Random Forest Works:
1. **Random Sampling**: The algorithm creates multiple decision trees by randomly sampling data points (with replacement) from the training dataset.
2. **Random Feature Selection**: Each tree is built using a random subset of features, making each tree slightly different.
3. **Prediction**: For classification tasks, the majority vote from all the trees is taken as the final prediction. For regression tasks, the average prediction is used.

#### Diagram: Random Forest

```
                Training Data
                      |
                ----------------------
                |        |        |    |
              Tree 1   Tree 2   Tree 3  ...
                |        |        |  
      Prediction 1  Prediction 2  Prediction 3
                |        |        |
            Majority Vote / Averaging
                      |
                Final Prediction
```

#### Advantages of Random Forest:
- Handles both classification and regression problems.
- Reduces overfitting due to the averaging of multiple trees.
- Automatically handles missing data.

#### Key Parameters:
- `n_estimators`: The number of trees in the forest.
- `max_depth`: The maximum depth of each tree.
- `max_features`: The number of features considered for splitting at each node.

---

### 2. **Logistic Regression**

Despite its name, Logistic Regression is a classification algorithm, not a regression algorithm. It is used when the dependent variable is categorical, i.e., the output belongs to one of two classes.

#### How Logistic Regression Works:
1. **Linear Combination**: Logistic regression takes a linear combination of input features:  
   \( z = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n \)
2. **Sigmoid Function**: It then applies the **sigmoid function** to the result to obtain a probability score between 0 and 1.  
   sigma(z) = 1 / (1 + exp(-z))
3. **Decision Boundary**: A threshold (typically 0.5) is applied to classify the outcome as either class 0 or class 1.

#### Diagram: Logistic Regression

```
               Features (X1, X2, ..., Xn)
                     |
        Linear Combination: z = w0 + w1*X1 + w2*X2 + ...
                     |
              Sigmoid Function: σ(z) = 1 / (1 + e^(-z))
                     |
             Probability: 0 ≤ P ≤ 1
                     |
            Threshold (0.5):  P > 0.5 → Class 1
                             P ≤ 0.5 → Class 0
```

#### Advantages of Logistic Regression:
- Simple and fast to train.
- Interpretable model (coefficients can be interpreted as the importance of each feature).
- Works well for binary classification problems.

#### Key Parameters:
- `penalty`: Regularization term to prevent overfitting (L1, L2).
- `C`: Inverse of regularization strength (smaller values specify stronger regularization).

---

### 3. **Support Vector Machines (SVM)**

Support Vector Machines are powerful classification algorithms that aim to find the **best separating hyperplane** between different classes. The idea is to maximize the **margin** between the two classes.

#### How SVM Works:
1. **Hyperplane**: SVM finds the hyperplane that best separates the classes. The points that lie closest to the hyperplane are called **support vectors**.
2. **Maximizing the Margin**: The goal is to maximize the distance between the support vectors of each class and the hyperplane.
3. **Kernel Trick**: For non-linearly separable data, SVM can transform the input space using a kernel function (e.g., polynomial, RBF) to make the data linearly separable.

#### Diagram: SVM

```
        +                                +            ← Class 1
                      (Support Vector)      
        +                    |                   +
              ----------- Hyperplane ------------
                              |
         (Support Vector)                ← Class 0
        o                                o        
        o          o                     o
```

#### Advantages of SVM:
- Works well for high-dimensional data.
- Effective in cases where the number of features is greater than the number of samples.
- Can be used for both classification and regression.

#### Key Parameters:
- `C`: Regularization parameter, controls trade-off between achieving a low error on training data and minimizing the margin.
- `kernel`: Defines the type of hyperplane used (linear, RBF, polynomial).
- `gamma`: Kernel coefficient for RBF, polynomial, and sigmoid kernels.

---

### Model Evaluation

For this assignment, we will evaluate the models using **accuracy**, which is a common metric for classification tasks. Accuracy is the ratio of correctly predicted instances to the total number of instances.

Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)

While accuracy is useful, it may not be ideal when dealing with imbalanced datasets (where one class has significantly more instances than the other). In such cases, other metrics like precision, recall, and F1-score are more informative.

---

### Conclusion

In this assignment, you will train and evaluate Random Forest, Logistic Regression, and SVM models on a dataset. By comparing their accuracy, you will gain insight into how different machine learning algorithms perform on the same problem and understand their strengths and weaknesses.

Remember, the key to mastering these models lies in experimenting with the hyperparameters and understanding how each model works under different conditions. Pay close attention to the differences in performance, especially when tweaking parameters such as `n_estimators`, `C`, and `kernel`.

---


## Assignment Part 1: Follow Me – Umbrella Decision-Making System

In this section, you will follow along to build a rule-based and data-driven AI system that helps make a decision on whether to carry an umbrella. You’ll explore how different conditions, such as weather and personal preferences, can influence an AI agent’s decision-making process.


In [1]:
!pip install simpleai

Collecting simpleai
  Downloading simpleai-0.8.3.tar.gz (94 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: simpleai
[33m  DEPRECATION: Building 'simpleai' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'simpleai'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m  Building wheel for simpleai (setup.py) ... [?25ldone
[?25h  Created wheel for simpleai: filename=simpleai-0.8.3-py3-none-any.whl size=101051 sha256=9d2fb755caf59eeeae44196b429a5ff5ab77dabb89a1a77d339642fa97099c03
  Stored in directory: /Users/saitejasunku/Library/Caches/pip/wheels/17/b6/77/1a5175bf5cc1f0a123ed32a7035c69650db6076435ffdd870f
Successful

In [2]:
from simpleai.search import CspProblem, backtrack

In [3]:
# Define the variables and domains
variables = ['weather']
domains = {
    'weather': ['sunny', 'cloudy', 'rainy', 'windy']
}

In [4]:
# Define the rule constraints for the umbrella decision-making process
def umbrella_decision(variables, values):
    weather = values[0]  # Extract the weather condition from input
    if weather == 'rainy':  # If it's rainy
        return 'Carry an umbrella'
    elif weather == 'cloudy':  # If it's cloudy
        return 'Carry an umbrella'
    else:  # For all other weather (e.g., sunny, windy)
        return 'No umbrella needed'

In [5]:
# Create the CSP (Constraint Satisfaction Problem)
# We pass the variables, domains, and the umbrella decision rules to create the problem.
constraints = [(('weather',), umbrella_decision)]
problem = CspProblem(variables, domains, constraints)

In [6]:
# Test the rule-based agent by checking different weather conditions
weather_conditions = ['sunny', 'cloudy', 'rainy', 'windy']  # Example weather conditions
for weather in weather_conditions:
    # Use backtrack search to solve the CSP for each weather condition
    solution = backtrack(problem, inference=True, variable_heuristic='mrv', value_heuristic='least_constraining_value')
    print(f"Weather: {weather} -> Decision: {umbrella_decision(variables, [weather])}")

Weather: sunny -> Decision: No umbrella needed
Weather: cloudy -> Decision: Carry an umbrella
Weather: rainy -> Decision: Carry an umbrella
Weather: windy -> Decision: No umbrella needed


Step 2: Data-driven decision-making using machine learning
In this section, we will use a data-driven approach to decide whether to carry an umbrella based on weather data.

In [7]:
# Import necessary libraries for machine learning
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [8]:
# Create a sample dataset
data = {
    "temperature": [30, 22, 25, 10, 15, 18, 29, 35, 20, 12],
    "humidity": [70, 65, 80, 90, 75, 50, 55, 60, 85, 95],
    "wind_speed": [10, 5, 12, 7, 8, 6, 15, 20, 9, 4],
    "weather": ["rainy", "cloudy", "sunny", "rainy", "cloudy", "sunny", "cloudy", "sunny", "rainy", "rainy"],
    "umbrella": [1, 1, 0, 1, 1, 0, 1, 0, 1, 1]  # 1 = Carry an umbrella, 0 = No umbrella needed
}

In [9]:
# Create a DataFrame from the data
df = pd.DataFrame(data)

In [10]:
# Encode the categorical data
df_encoded = pd.get_dummies(df, columns=['weather'])

In [11]:
# Split the dataset into features (X) and target (y)
X = df_encoded.drop('umbrella', axis=1)
y = df_encoded['umbrella']

In [12]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [13]:
# Initialize and train the model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [14]:
# Predict and evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy:.2f}")

Decision Tree Accuracy: 1.00


## Assignment Part 2: Your Turn – Traffic Control System

In this section, you will implement a machine learning model designed for traffic control. Using the provided datasets and relevant features, you’ll develop a system that predicts and controls traffic flow, simulating real-world decision-making for traffic management systems. **A framework has been provided, and your job is to complete the TODOs.**

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import random


In [16]:
# Provided dataset

# Updated to proivde correlation for ML #

# Define the possible values for each feature
times_of_day = ['morning', 'rush_hour', 'afternoon', 'night']
traffic_volumes = ['low', 'medium', 'high']
weather_conditions = ['clear', 'rainy', 'foggy']
pedestrian_traffic_levels = ['low', 'medium', 'high']

# Generate 10,000 samples of synthetic traffic data
num_samples = 10000
data = {
    'time_of_day': [],
    'traffic_volume': [],
    'weather': [],
    'pedestrian_traffic': [],
    'camera_input': [],
    'emergency_detection': [],
    'accidents': []
}

for _ in range(num_samples):
    # Generate feature values
    time_of_day = random.choice(times_of_day)
    traffic_volume = random.choices(
        ['low', 'medium', 'high'],
        weights=[1, 3, 5] if time_of_day == 'rush_hour' else [5, 3, 1]
    )[0]  # High traffic more likely during rush hour
    weather = random.choices(
        ['clear', 'rainy', 'foggy'],
        weights=[7, 2, 1] if time_of_day == 'morning' else [5, 3, 2]
    )[0]
    pedestrian_traffic = random.choices(
        ['low', 'medium', 'high'],
        weights=[3, 4, 2] if time_of_day in ['rush_hour', 'afternoon'] else [5, 3, 1]
    )[0]
    camera_input = random.choices([0, 1], weights=[8, 2])[0]
    emergency_detection = random.choices([0, 1], weights=[9, 1])[0]

    # Determine accident likelihood
    accident_probability = 0.05  # Base probability
    if traffic_volume == 'high' and weather == 'foggy':
        accident_probability += 0.3  # Higher chance in high traffic and fog
    if pedestrian_traffic == 'high' or time_of_day == 'rush_hour':
        accident_probability += 0.1  # Higher chance during high pedestrian traffic
    if camera_input == 1:
        accident_probability += 0.1  # Issues detected increase chance

    # Generate the target variable
    accident = 1 if random.random() < accident_probability else 0

    # Append the generated values to the dataset
    data['time_of_day'].append(time_of_day)
    data['traffic_volume'].append(traffic_volume)
    data['weather'].append(weather)
    data['pedestrian_traffic'].append(pedestrian_traffic)
    data['camera_input'].append(camera_input)
    data['emergency_detection'].append(emergency_detection)
    data['accidents'].append(accident)

# Create DataFrame
df = pd.DataFrame(data)

In [17]:
# Create a DataFrame from the data
df = pd.DataFrame(data)

In [18]:
# Encode the categorical data
df_encoded = pd.get_dummies(df, columns=['time_of_day', 'traffic_volume', 'weather', 'pedestrian_traffic'])

In [19]:
# Split the dataset into features (X) and target (y)
X = df_encoded.drop('accidents', axis=1)
y = df_encoded['accidents']

In [20]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [21]:
# Initialize and train the models
# 1. Random Forest
rf_model = RandomForestClassifier()

#### TODO: Customize RandomForestClassifier parameters (e.g., n_estimators, max_depth) ####

rf_model.fit(X_train, y_train)


0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [22]:
# 2. Logistic Regression
lr_model = LogisticRegression()

#### TODO: Customize LogisticRegression parameters (e.g., penalty, solver) ####

lr_model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [23]:
# 3. Support Vector Machine (SVM)
svm_model = SVC()

#### TODO: Customize SVC parameters (e.g., kernel, C) ####

svm_model.fit(X_train, y_train)

0,1,2
,C,1.0
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


Step 2: Predict with the models

In [24]:
# Define the prediction data based on the structure of X_train
# We'll create a DataFrame with zero values and set the specific prediction values
prediction_data = pd.DataFrame(0, index=[0], columns=X_train.columns)

In [25]:
# Set the specific values for the prediction scenario
prediction_data['time_of_day_afternoon'] = 1
prediction_data['traffic_volume_high'] = 1
prediction_data['weather_rainy'] = 1
prediction_data['pedestrian_traffic_low'] = 1
prediction_data['camera_input'] = 1
prediction_data['emergency_detection'] = 0

In [26]:
# Predict using Random Forest
rf_prediction = rf_model.predict(prediction_data)
print(f"Random Forest Prediction: {rf_prediction[0]}")

Random Forest Prediction: 0


In [27]:
# Predict using Logistic Regression
lr_prediction = lr_model.predict(prediction_data)
print(f"Logistic Regression Prediction: {lr_prediction[0]}")

Logistic Regression Prediction: 0


In [28]:
# Predict using Support Vector Machine (SVM)
svm_prediction = svm_model.predict(prediction_data)
print(f"SVM Prediction: {svm_prediction[0]}")

SVM Prediction: 0


In [29]:
# Evaluate the models
# Random Forest
rf_test_predictions = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_test_predictions)
print(f"Random Forest Accuracy: {rf_accuracy:.2f}")

Random Forest Accuracy: 0.87


In [30]:
# Logistic Regression
lr_test_predictions = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_test_predictions)
print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}")

Logistic Regression Accuracy: 0.88


In [31]:
# Support Vector Machine (SVM)
svm_test_predictions = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_test_predictions)
print(f"SVM Accuracy: {svm_accuracy:.2f}")

SVM Accuracy: 0.88


### TODO: Interpreting Model Decisions

Now that you have trained and evaluated multiple machine learning models, analyze their decisions by addressing the following questions:

1. **Model Decisions**: What decision did each model make? Were the predictions the same across all models, or did they differ?
2. **Comparison**: If the models made different predictions, what factors could explain these differences? Consider aspects such as feature importance, model type, and training data.
3. **Confidence and Performance**: Which model do you trust the most? Why? Support your answer with relevant metrics (e.g., accuracy, precision, recall, AUC) and observations from the data.
4. **Real-World Implications**: If this were a real-world scenario, how would you justify your chosen model’s decision to stakeholders?

**Action:** Write a short analysis (at least 3-5 sentences) summarizing your observations in a markdown cell below.
