In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
df = pd.read_csv('car_evaluation.csv', header=None)

# Check the shape and info of the dataset
print(df.shape)
print(df.info())

# Set column names
df.columns = ['buying_price', 'maintenance_cost', 'number_of_doors', 'number_of_persons', 'lug_boot', 'safety', 'decision']
print(df.head())
print(df.describe())

# Encode categorical features
le = LabelEncoder()
for column in df.columns:
    df[column] = le.fit_transform(df[column])

print(df['buying_price'].unique())

print(df.head())

# Split into features and target
X = df.drop('decision', axis=1)  # Features
y = df['decision']  # Target (Encoded)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest model
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)


(1728, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       1728 non-null   object
 1   1       1728 non-null   object
 2   2       1728 non-null   object
 3   3       1728 non-null   object
 4   4       1728 non-null   object
 5   5       1728 non-null   object
 6   6       1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB
None
  buying_price maintenance_cost number_of_doors number_of_persons lug_boot  \
0        vhigh            vhigh               2                 2    small   
1        vhigh            vhigh               2                 2    small   
2        vhigh            vhigh               2                 2    small   
3        vhigh            vhigh               2                 2      med   
4        vhigh            vhigh               2                 2      med   

  safety decision  
0    low    unacc  
1    

### Ensemble Learning
Ensemble learning is a technique in machine learning where multiple models (often referred to as "weak learners") are trained and combined to solve a single task or improve predictive accuracy. The idea is that a group of models working together can often perform better than any individual model, as they bring different strengths and reduce the risk of overfitting and variance in predictions.

#### Types of Ensemble Learning Methods
1. **Bagging (Bootstrap Aggregating)**: Each model in the ensemble is trained on a random subset of the data. A popular example is the Random Forest algorithm.
2. **Boosting**: Models are trained sequentially, with each new model focusing on the errors made by the previous models. Algorithms like AdaBoost and Gradient Boosting fall under this category.
3. **Stacking**: Different types of models are trained, and then their outputs are combined by another model (often called a meta-learner) to make the final prediction.

### Random Forest Classifier
The **Random Forest** is an ensemble method that combines multiple decision trees to make predictions. It’s a popular example of a bagging method, specifically designed for classification and regression tasks.

#### Key Concepts of Random Forest:
1. **Decision Trees**: Random Forest builds multiple decision trees, each trained on a random subset of the data with replacement (bootstrap sampling).
2. **Random Feature Selection**: At each split within a decision tree, Random Forest only considers a random subset of features, which introduces further randomness and diversity in the trees.
3. **Voting Mechanism**: For classification tasks, each tree in the forest makes a prediction, and the class with the most votes becomes the final prediction. In regression tasks, the average of the trees' predictions is used.

### Advantages of Random Forest:
- **High Accuracy**: Due to the ensemble of trees, Random Forest generally has higher accuracy than a single decision tree.
- **Reduced Overfitting**: By averaging multiple trees, Random Forest helps reduce the risk of overfitting, especially when there is a lot of noise in the data.
- **Feature Importance**: Random Forest provides a measure of feature importance, showing which features contribute most to the predictions.

### Example of How Random Forest Works:
1. **Training Phase**:
   - The Random Forest algorithm creates multiple decision trees using different random subsets of the data and random subsets of features at each split.
2. **Prediction Phase**:
   - For a new data point, each tree in the forest makes a prediction.
   - For classification, the Random Forest takes a majority vote across all trees; for regression, it averages the predictions from all trees.

In summary, **ensemble learning** combines multiple models to improve performance, while the **Random Forest classifier** is a specific ensemble learning method that builds multiple decision trees and aggregates their results for reliable predictions.

### Purpose of This Practical
The practical you're working on is focused on **predicting the safety of a car** based on various features such as:
- **Buying Price**
- **Maintenance Cost**
- **Number of Doors**
- **Number of Persons**
- **Luggage Boot Size**
- **Safety Rating**

The **goal** of this practical is to **train a Random Forest Classifier** to predict the car's safety rating based on these features. This involves:
1. **Understanding and Preprocessing Data**: You load the dataset, encode categorical features, and prepare the data for training.
2. **Model Training and Evaluation**: You train a Random Forest classifier and evaluate its performance to predict the safety of a car.
3. **Feature Importance**: Understanding which features contribute the most to predicting the safety of a car.

### Does It Fulfill the Requirement of Predicting Safety of the Car?

Yes, this practical does fulfill the requirement of **predicting the safety of the car**. Here's why:

1. **Data Relevance**:
   - The dataset you're using is directly related to car evaluation, with features like **buying price**, **maintenance cost**, and **safety** being crucial factors in determining the overall safety of a car.
   - These features will be used by the Random Forest model to predict the target variable, which is the car's **safety rating**.

2. **Prediction Task**:
   - The task is to predict the **"safety"** feature, which is the dependent variable (target), based on other independent features (input variables). This aligns with the goal of predicting how safe a car is using other attributes.

3. **Use of Random Forest**:
   - Random Forest is an excellent choice for this type of classification task. It is a powerful ensemble learning method that can handle non-linear relationships between features and can provide high accuracy. It also reduces overfitting, which is important when dealing with small or noisy datasets like the one you're using.
   - The model can handle categorical features (like **buying price** or **number of doors**) effectively through the use of `LabelEncoder`.

4. **Model Evaluation**:
   - By splitting the data into training and testing sets and using metrics like **accuracy** and **classification report**, you are effectively assessing how well the model can predict the safety of a car. These metrics provide insights into the model's precision, recall, and overall performance.

### Steps Involved in the Practical:
1. **Data Preprocessing**:
   - Convert categorical variables into numerical values using `LabelEncoder`.
   - Split the data into features (X) and the target variable (y).

2. **Model Training**:
   - Train the **Random Forest classifier** on the training data to learn the relationship between car attributes and safety.

3. **Model Evaluation**:
   - Use **accuracy** and **classification report** to evaluate how well the model predicts the car's safety on unseen data (test set).

4. **Prediction**:
   - Use the trained Random Forest model to predict the safety of the cars based on their attributes.

### Conclusion
This practical meets the requirement of predicting the safety of a car because:
- It uses relevant features to predict the safety rating.
- It applies a machine learning model (Random Forest) which is appropriate for classification tasks like predicting safety.
- The evaluation using accuracy and classification metrics ensures that the model is working correctly for this prediction task.

So, yes, this practical is designed to **predict the safety of the car** based on the provided dataset and fulfills the task requirements.

Here’s an explanation of each step of the process, without the code:

### 1. **Importing Libraries**:
- First, we import the necessary libraries for data manipulation (`pandas`), machine learning model building (`RandomForestClassifier`), data encoding (`LabelEncoder`), and model evaluation (`classification_report`, `accuracy_score`).

### 2. **Loading the Dataset**:
- The dataset is loaded from a CSV file into a DataFrame, where each row represents a car and each column represents a feature (e.g., buying price, maintenance cost). This step allows us to work with the data in a structured format.

### 3. **Checking the Dataset**:
- We check the dimensions (`shape`) and structure (`info`) of the dataset to understand how many rows and columns it has, and to check the data types and missing values.
- We also take a quick look at the first few rows of data and get summary statistics to understand the data distribution and range of values.

### 4. **Assigning Column Names**:
- Column names are assigned to the dataset to make the data more meaningful. This step is important for easy reference to each feature in the dataset, like 'buying_price', 'maintenance_cost', etc.

### 5. **Encoding Categorical Features**:
- Since machine learning models work with numeric values, categorical features (like "high", "low", "med") need to be converted into numeric values. This is done using **Label Encoding**, where each unique category is assigned a numerical value.

### 6. **Splitting the Data into Features and Target**:
- We separate the dataset into two parts: features (X), which are the attributes used to predict the outcome (e.g., buying price, number of doors), and the target (y), which is the column we are trying to predict (in this case, the car’s safety decision).

### 7. **Train-Test Split**:
- The data is divided into two sets: one for training the model and one for testing it. Typically, 80% of the data is used for training, and 20% is used for testing. This allows us to evaluate how well the model generalizes to unseen data.

### 8. **Training the Random Forest Model**:
- A **Random Forest Classifier** is initialized and trained using the training data (features and target). A Random Forest is a collection of decision trees that work together to make predictions. It’s known for its accuracy and ability to handle a variety of data types.

### 9. **Making Predictions**:
- After the model is trained, we use it to make predictions on the test data (which the model hasn't seen before). This step is where the model "predicts" the safety decision for each car in the test set.

### 10. **Evaluating the Model**:
- The performance of the model is evaluated by comparing its predictions to the actual target values from the test data.
  - **Accuracy** is calculated to see how many predictions were correct.
  - A **classification report** is generated to give a more detailed evaluation, including precision, recall, F1-score, and support. These metrics help assess how well the model performs across different classes (safe or not safe).

### 11. **Printing the Results**:
- The results of the model evaluation, including accuracy and the detailed classification report, are printed to provide insight into how well the model is performing in predicting the safety of the car.

---

### Purpose of This Approach:
The purpose of this approach is to predict the safety of a car based on its various attributes (e.g., buying price, maintenance cost, number of doors). By transforming the data, splitting it into features and target, training a Random Forest model, and evaluating its performance, we can determine how accurately the model predicts whether a car is safe or not. The final goal is to use the model's predictions to assess car safety based on historical data, allowing for better decision-making or classification of cars into safety categories.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset
df = pd.read_csv('car_evaluation.csv', header=None)
print("Dataset loaded successfully.")

# Set column names
df.columns = ['buying_price', 'maintenance_cost', 'num_doors', 'num_persons', 'lug_boot', 'safety', 'decision']
print("Column names set.")

# Initialize LabelEncoder and create a dictionary to store mappings
le = LabelEncoder()
label_mappings = {}

# Encode categorical features and store the mappings
print("\nEncoding categorical features and storing mappings:")
for column in df.columns:
    df[column] = le.fit_transform(df[column])
    # Store label mapping for the column
    label_mappings[column] = dict(zip(le.classes_, le.transform(le.classes_)))
    # Print mappings for each column
    print(f"\nColumn: {column}")
    for category, encoded_value in label_mappings[column].items():
        print(f"  {category}: {encoded_value}")

# Define features and target
X = df.drop('decision', axis=1)  # Features
y = df['decision']  # Target
print("\nFeatures and target variable separated.")

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Data split into training and testing sets.")

# Initialize and train the Random Forest model
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
print("Training the Random Forest model...")
rf_classifier.fit(X_train, y_train)
print("Model training completed.")

# Make predictions
y_pred = rf_classifier.predict(X_test)
print("Predictions made on the test set.")

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"\nModel Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n", report)


Dataset loaded successfully.
Column names set.

Encoding categorical features and storing mappings:

Column: buying_price
  high: 0
  low: 1
  med: 2
  vhigh: 3

Column: maintenance_cost
  high: 0
  low: 1
  med: 2
  vhigh: 3

Column: num_doors
  2: 0
  3: 1
  4: 2
  5more: 3

Column: num_persons
  2: 0
  4: 1
  more: 2

Column: lug_boot
  big: 0
  med: 1
  small: 2

Column: safety
  high: 0
  low: 1
  med: 2

Column: decision
  acc: 0
  good: 1
  unacc: 2
  vgood: 3

Features and target variable separated.
Data split into training and testing sets.
Training the Random Forest model...
Model training completed.
Predictions made on the test set.

Model Accuracy: 0.97

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.90      0.94        83
           1       0.65      1.00      0.79        11
           2       0.99      1.00      1.00       235
           3       1.00      0.94      0.97        17

    accuracy                 