#1. What is a Decision Tree, and how does it work in the context of classification?

#Ans. A decision tree is a type of supervised learning algorithm that uses a tree-like structure to classify data.
  # Here's how it works:
  1. Root Node: The tree starts with a root node,representing the entire dataset.
  2. Splitting: The algorithm splits the data into subsets based on a feature or attribute.
  3. Decision Nodes: Each internal node represents a decision or split,based on a specific feature or attribute.
  4. Leaf Nodes: The terminal nodes,or leaf nodes,represent the predicted class labels.
  5. Classification: New,unseen data is classified by traversing the tree from the root node to a leaf node,based on the feature values.  

  The Decision Tree algorithm recursively partitons the data,creating a hierarchical structure that captures the relationships between features and class labels.

#2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

#Ans. Gini Impurity and Entropy are two common impurity measures used in decision trees to determine the best splits.
 # Gini Impurity
 1. Measures the probability of misclassifying a sample: Gini impurity calculates the probability of incorrectly classifying a randomly chosen sample from a node,if it were randomly labeled according to the class distribution of the node.
 2. Formula: where p_i is the proportion of class i in the node.
 3. Range: 0(pure node) to 1 (impure node).

# Entropy
1. Measures the uncertainity or randomness: Entropy calculates the amount of uncertainity or randomness in the class distribution of a node.
2. Formula: where p_i is the proportion of class i in the node.
3. Range: 0(pure node) to log2(k) (impure node), where k is the number of classes.

# Impact on splits
Both Gini Impurity and Entropy are used to evaluate the quality of splits in a decision tree. The goals is to find the split that:

1. Reduced Impurity: Decreases the Gini Impurity or Entropy of the child nodes compared to the parent node.
2. Increases Purity: Increases the proportion of samples of a single class in each child node.

The decision tree algorithm choses the split that results in the largest reduction in impurity,which leads to more accurate classification.

#3. What is the difference between Pre-Pruning and Post-Pruning in Decision trees? Give one practical advantage of using each.

#Ans. Pre-Pruning(Early Stopping):
1. Stops growing the tree before it reaches its maximum depth: Based on certain criteria,such as maximum depth,minimum number of samples per node, or minimum impurity decreases.
2. Practical advantage: Reduces computational cost and prevents overfitting by stopping the growth of the tree early.


# Post-Pruning(Cost-Complexity Pruning):
1. Removes branches from a fully grown tree: Based on a cost complexity metric,whichbalances the tree's complexity and accuracy.
2. Practical advantage: Allows for a more optimal pruning strategy,as it considers the entire tree structure and can remove branches that don't contribute significancy to the model's accuracy.

In summary,pre-pruning is faster and more efficient,while post-pruning can lead to more accurate models by considering the entire tree structure.

#4. What is Information Gain in Decision Trees,and why is it important for choosing the best split?

#Ans. Information Gain in Decision Trees
 Information Gain measures the reduction in impurity or uncertainity in the target variable after splitting the data based on a particular feature.
   # It's calculated as:
   Information Gain = (Impurity of parent node)-(Weighted average of impurity of child nodes)

   # Important for Choosing the Best Split
   Information Gain is important because it helps decision trees:

   1. Identify the most informative features: Features with high information Gain are more useful for splitting the data and reducing uncertainity.
   2. Choose the best split: The feature with the highest information gain is selected as the best split,resulting in a more accurate and efficient decision tree.

   By maximizing Information Gain,decision trees can effectively partiton the data and make accurate predictions.


#5. What are some common real-world applications of Decision Trees,and what are their main advantages and limitations?

#Ans. Real-World Applications of Decision Trees:
1. Credit Risk Assessment: Decision trees are used to evaluate creditworthiness and predict loan defaults.
2. Medical Diagnosis: Decision trees help diagnosis diseases based on symptoms and patient data.
3. Customer Segmentation: Decision trees segment customers based on demographics,behaviour,and preferences.
4. Predictive Maintenance: Decision trees predict equipment failures and schedule maintenance.
5. Fraud Detection: Decision trees detect suspicious transactions and prevent financial losses.

# Main Advantages:
1. Interpretability: Decision trees are easy to understand and visualize.
2. Handle categorical variables: Decision trees can handle categorical variables directly.
3. Fast training: Decision trees are relatively fast to train.

# Limitations:
1. Overfitting: Decision trees can overfit the training data,especially if they're too deep.
2. Instability: Small changes in the data can result in significantly different trees.
3. Limited handling of complex relationships: Decision trees can struggle with complex relationships between variables.

Despite these limitations,decision trees remain a popular and effective tool for many applications especially when combined with ensemble methods like Random Forests.

# Dataset Info:
Iris Dataset for classification tasks(sklearn.datasets.load.iris() or provided CSV).

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

In [None]:
print (iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [None]:
df = pd.DataFrame(iris.data,columns=iris.feature_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


#6. Write a python program to:
.Load the iris Dataset.
.Train a Decision Tree Classifier using the Gini criterion.
.Print the model's accuracy and feature importances.

# Load the iris dataset.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

In [None]:
# Load the iris dataset
iris=load_iris()

In [None]:
# Create a DataFrame
iris_df = pd.DataFrame(data=iris.data,columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df['species'] = iris_df['target'].apply(lambda x: iris.target_names[x])

In [None]:
print(iris_df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target species  
0       0  setosa  
1       0  setosa  
2       0  setosa  
3       0  setosa  
4       0  setosa  


# Train a Decision Tree Classifier using the gini criterion.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,classification_report

In [None]:
# Load the iris dataset
iris = load_iris()

In [None]:
# Split the dataset into features (X) and target (Y)
X = iris.data
Y = iris.target

In [None]:
# Split the data into training and testing sets
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=42)

In [None]:
# Create a Decision Tree Classifier using the Gini criterion
df = DecisionTreeClassifier(criterion='gini',random_state=42)

In [None]:
# Train the model
df.fit(X_train,Y_train)

# Print the model accuracy and feature importance

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,classification_report

In [None]:
# Load the iris dataset:
iris = load_iris()

In [None]:
# Split the dataset into features (X) and target (y)
X = iris.data
y = iris.target

In [None]:
# Create a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion='gini',random_state=42)

In [None]:
# Train the model
clf.fit(X,y)

In [None]:
# Make predictions on the test set
y_pred = clf.predict(X_test)

In [None]:
# Print feature importance
feature_importance = clf.feature_importances_
for i in range(len(iris.feature_names)):
  print(f"Feature: {iris.feature_names[i]},Importance: {feature_importance[i]:.3f}")

Feature: sepal length (cm),Importance: 0.013
Feature: sepal width (cm),Importance: 0.000
Feature: petal length (cm),Importance: 0.564
Feature: petal width (cm),Importance: 0.423


#7. Write a python program to:
.Load the iris datset.
.Train a Decision tree classifier with max_depth=3 and compare its accuracy to a fully grown tree.

# Load the iris dataset.

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

In [None]:
# Load the iris dataset
iris = load_iris()

In [None]:
# Create a DataFrame
iris_df = pd.DataFrame(data=iris.data,columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df['species'] = iris_df['target'].apply(lambda x: iris.target_names[x])
print(iris_df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target species  
0       0  setosa  
1       0  setosa  
2       0  setosa  
3       0  setosa  
4       0  setosa  


# Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully grown tree.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score,classification_report

In [None]:
# Load the iris dataset
iris = load_iris()

In [None]:
# Split the dataset into features (X) and target (y)
X = iris.data
y = iris.target

In [None]:
# Split the data into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# Create a fully grown Decision Tree Classifier
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train,y_train)
full_tree_pred = full_tree.predict(X_test)
full_tree_accuracy = accuracy_score(y_test,full_tree_pred)

In [None]:
# Create a Decision Tree Classifierwith max_depth=3
pruned_tree = DecisionTreeClassifier(max_depth=3,random_state=42)
pruned_tree.fit(X_train,y_train)
pruned_tree_pred = pruned_tree.predict(X_test)
pruned_tree_accuracy = accuracy_score(y_test,pruned_tree_pred)

In [None]:
# Compare the accuracy of the two trees
print(f"Accuracy of the fully grown tree: {full_tree_accuracy:.2f}")
print(f"Accuracy of the pruned tree: {pruned_tree_accuracy:.2f}")

Accuracy of the fully grown tree: 1.00
Accuracy of the pruned tree: 1.00


In [None]:
if full_tree_accuracy > pruned_tree_accuracy:
  print("The fully grown tree performs better.")
else:
  print("The pruned tree performs better or is equal to the fully grown tree.")
  print("Both trees have the same performance.")

The pruned tree performs better or is equal to the fully grown tree.
Both trees have the same performance.


#8. Write a python program to:
.Load the boston housing dataset
.Train a Decision Tree Regressor
.Print the Mean Squared Error (MSE) and feature importances


# Load the boston housing dataset.

In [None]:
import pandas as pd

In [None]:
# Load the Boston Housing dataset
url = "http://lib.stat.cmu.edu/datasets/boston"
data = pd.read_csv(url,skiprows=22,header=None,sep='\s+',engine='python')
data.columns=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO']

print(data.head())

        CRIM     ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0    0.00632  18.00   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  396.90000   4.98  24.00   NaN    NaN    NaN   NaN     NaN  NaN    NaN   
2    0.02731   0.00   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
3  396.90000   9.14  21.60   NaN    NaN    NaN   NaN     NaN  NaN    NaN   
4    0.02729   0.00   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   

   PTRATIO  
0     15.3  
1      NaN  
2     17.8  
3      NaN  
4     17.8  


# Train a decision tree regressor.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
import numpy as np

In [None]:
# Generate a regression dataset
X,y = make_regression(n_samples=1000,n_features=10,random_state=42)

In [None]:
# Split the data into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# Create a Decision Tree Regressor
dtr = DecisionTreeRegressor(random_state=42)

In [None]:
# Train the regressor
dtr.fit(X_train,y_train)

In [None]:
# Make predictions on the test set
y_pred = dtr.predict(X_test)

In [None]:
# Evaluate the regressor
mse = mean_squared_error(y_test,y_pred)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE):{rmse:.2f}")

Root Mean Squared Error (RMSE):81.64


# Print the Mean Squared Error(MSE) and feature importance.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
import numpy as np

In [None]:
# General a regression dataset
X,y = make_regression(n_samples=1000,n_features=10,random_state=42)

In [None]:
# Split the data into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# Create a Decision Tree Regressor
dtr = DecisionTreeRegressor(random_state=42)

In [None]:
# Train the regressor
dtr.fit(X_train,y_train)

In [None]:
# Make predictions on the test set
y_pred = dtr.predict(X_test)

In [None]:
# Evaluate the regressor
mse = mean_squared_error(y_test,y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")

Mean Squared Error (MSE): 6664.48


In [None]:
# Print feature importance
feature_importance = dtr.feature_importances_
for i in range(X.shape[1]):
  print(f"Feature {i+1}: {feature_importance[i]:.3f}")

Feature 1: 0.032
Feature 2: 0.300
Feature 3: 0.038
Feature 4: 0.016
Feature 5: 0.009
Feature 6: 0.207
Feature 7: 0.032
Feature 8: 0.011
Feature 9: 0.015
Feature 10: 0.340


#9. Write a python program to:
. Load the iris dataset
.Tune the Decision Tree's max_depth and min_samples_split using GridSearchCV
. Print the best parametrs and the resulting model accuracy


# Load the iris dataset

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

In [None]:
# Load the iris dataset
iris = load_iris()

In [None]:
# Create a DataFrame
iris_df = pd.DataFrame(data=iris.data,columns=iris.feature_names)
iris_df['target']=iris.target
iris_df['species']=iris_df['target'].apply(lambda x: iris.target_names[x])

In [None]:
print(iris_df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target species  
0       0  setosa  
1       0  setosa  
2       0  setosa  
3       0  setosa  
4       0  setosa  


# Tune the Decision Tree's max_depth and min_samples_split using GridSearchCV

In [None]:

from sklearn.tree import DecisionTreeClassifier # or DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {
    'max_depth' : [3,5,7,10, None], #Example depths, None means no limit
    'min_samples_split': [2,5,10,20] #Examples minimum samples for a split
}

In [None]:
dtree = DecisionTreeClassifier(random_state=42)

In [None]:
grid_search = GridSearchCV(estimator=dt_model,
                           param_grid=param_grid,
                           cv=5, # Number of cross-validation folds
                           scoring='accuracy', # or 'neg_mean_squared_error' for regression
                           n_jobs=-1, # Use all available CPU cores
                           verbose=1) # Print progress messages

In [None]:

print("Best Score:",grid_search.best_score_)
best_dtree = grid_search.best_estimator_

AttributeError: 'GridSearchCV' object has no attribute 'best_score_'

# Print the best parameters and the resulting model accuracy.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
# Load the iris dataset
iris = load_iris()

In [None]:
# Split the dataset into features (X) and target (y)
X = iris.data
y = iris.target

In [None]:
# Split the data into training and testing sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# Define the hyperparameter grid
param_grid = {
    'max_depth': [None, 3,5,7,10],
    'min_samples_split': [2,5,10,15,20]
}


In [None]:
# Create a Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

In [None]:
# Perform grid search
grid_search = GridSearchCV(estimator=dt,param_grid=param_grid,cv=5,scoring='accuracy',n_jobs=-1,verbose=1)
grid_search.fit(X_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits


In [None]:
# print the best parameters and the resulting model accuracy
print(f"Best Parameters: {grid_search.best_params_}")
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
print(f"Model Accuracy: {accuracy:.3f}")

Best Parameters: {'max_depth': None, 'min_samples_split': 2}
Model Accuracy: 1.000


#10. Imagine you're working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
     Explain the step-by-step process you would follow to:
     . Handle the missing values
     . Encode the categorical features
     . Train a Decision Tree Model
     . Tune its hyperparameters
     . Evaluate its performance
         And describe what business value this model could provide in the real-world setting.

#Step 1: Handle Missing Values
1. Identify missing values.
2. Determine the type of missing values.
3. Choose an imputation strategy:
(a) Mean/Median imputation for numerical features.
(b) Mode imputation for categorical features.
(c) K-Nearest Neighbors (KNN) imputation.
(d) Multiple imputation by Chained Equations(MICE).
4. Implement imputation: Use panda's fillna() function or scikit-learn's Imputer class to impute missing values.
#Step 2: Encode Categorical Features
1. Identify categorical features.
2. Choose an encoding strategy.
3. Implement encoding.
#Step 3: Train a Decision Tree Model
1. Split data.
2. Train a Decision Tree Model.
#Step 4: Tune Hyperparameters
1. Define hyperparameter grid.
2. Use GridSearchCV.
3. Identify best hyperparameters.
#Step 5: Evaluate Performance
1. Evaluate on test set.
2. Compare to baseline.

#Business Value:
1. Early disease detection: The model can help detect diseases at an early stage,allowing for timely interventions and improving patient outcomes.
2. Personalized medicine: The model can be used to develop personalized treatment plans based on individual patient characteristics.
3. Resource allocation: The model can help healthcare providers allocate resources more effectively by identifying high-risk patients and prioritizing their care.
4. Cost savings: The model can help reduce healthcare costs by reducing the number of unneccesary tests and procedures.

By developing an accurate and reliable disease prediction model,healthcare providers can improve patient outcomes,reduce costs,and enhance the overall quality of care.