# Decision Tree Regression with Python and Scikit-Learn

##### In this project, I build a Decision Tree Classifier to predict the safety of the car. I build two models, one with criterion gini index and another one with criterion entropy. I implement Decision Tree Classification with Python and Scikit-Learn. I have used the Car Evaluation Data Set for this project, downloaded from the UCI Machine Learning Repository website.

### Table of Contents

##### 1. Introduction to Decision Tree algorithm
##### 2. Classification and Regression Trees
##### 3. Decision Tree algorithm intuition
##### 4. Attribute selection measures - Information gain - Gini index
##### 5. The problem statement
##### 6. Dataset description
##### 7. Import libraries
##### 8. Import dataset
##### 9. Exploratory data analysis
##### 10. Declare feature vector and target variable
##### 11. Split data into separate training and test set
##### 12. Feature engineering
##### 13. Decision Tree Regressor with criterion Mean Squared Error (MSE)
##### 14. Results and conclusion

### 1. Introduction to Decision Tree Algorithm

##### A Decision Tree algorithm is one of the most popular machine learning algorithms. It uses a tree like structure and their possible combinations to solve a particular problem. It belongs to the class of supervised learning algorithms where it can be used for both classification and regression purposes. A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.

### 2. Classification and Regression Trees

##### Nowadays, Decision Tree algorithm is known by its modern name CART which stands for Classification and Regression Trees. Classification and Regression Trees or CART is a term introduced by Leo Breiman to refer to Decision Tree algorithms that can be used for classification and regression modeling problems.The CART algorithm provides a foundation for other important algorithms like bagged decision trees, random forest and boosted decision trees. In this project, i will solve a regressionn problem.

### 3. The decision tree Intuition

##### The Decision-Tree algorithm is one of the most frequently and widely used supervised machine learning algorithms that can be used for both classification and regression tasks. The intuition behind the Decision-Tree algorithm is very simple to understand.##### 1. The Decision Tree algorithm intuition is as follows:-##### 2. For each attribute in the dataset, the Decision-Tree algorithm forms a node. The most important attribute is placed at the root node.##### 3. For evaluating the task in hand, we start at the root node and we work our way down the tree by following the corresponding node that meets our condition or decision.##### 4. This process continues until a leaf node is reached. It contains the prediction or the outcome of the Decision Tree

### 4. Attribute Selection Measure- Mean Squared Error (MSE)

##### Mean Square Error (MSE) is a common metric used to evaluate the performance of a regression model. It measures the average of the squares of the errors—that is, the average squared difference between the actual (true) values and the predicted values made by the model.

### 5. The problem statement

##### The problem is to predict the salary of a potential employee. In this project, I build a Decision Tree Regressor for the same purpose. I implement Decision Tree Regression with Python and Scikit-Learn. I have used the Expected CTC dataset for this project, which can be found on my github page.

### 6. Dataset description

##### The used dataset contains nearly 28 attributes related to employee details, such as details regarding previous employement and academic qualifications.

### 7. Importing Libraries 

In [246]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline 

In [247]:
import warnings #Ignore warnings
warnings.filterwarnings('ignore')

### 8. Import Dataset

In [249]:
data="D:\expected_ctc.csv" #Read data into a dataframe
df=pd.read_csv(data, header=None)

### 9. Exploratory Data Analysis

In [251]:
df.shape #Check size of the dataset

(25001, 29)

In [252]:
df.head() #Preview data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
0,IDX,Applicant_ID,Total_Experience,Total_Experience_in_field_applied,Department,Role,Industry,Organization,Designation,Education,...,Curent_Location,Preferred_location,Current_CTC,Inhand_Offer,Last_Appraisal_Rating,No_Of_Companies_worked,Number_of_Publications,Certifications,International_degree_any,Expected_CTC
1,1,22753,0,0,,,,,,PG,...,Guwahati,Pune,0,N,,0,0,0,0,384551
2,2,51087,23,14,HR,Consultant,Analytics,H,HR,Doctorate,...,Bangalore,Nagpur,2702664,Y,Key_Performer,2,4,0,0,3783729
3,3,38413,21,12,Top Management,Consultant,Training,J,,Doctorate,...,Ahmedabad,Jaipur,2236661,Y,Key_Performer,5,3,0,0,3131325
4,4,11501,15,8,Banking,Financial Analyst,Aviation,F,HR,Doctorate,...,Kanpur,Kolkata,2100510,N,C,5,3,0,0,2608833


In [253]:
# give meaningful names to the columns
col_names=['IDX','Applicant_ID','Total_Experience','Total_Experience_in_field_applied','Department','Role','Industry','Organization','Designation','Education','Graduation_Specialization','University_Grad','Passing_Year_Of_Graduation','PG_Specialization','University_PG','Passing_Year_Of_PG','PHD_Specialization','University_PHD','Passing_Year_Of_PHD','Curent_Location','Preferred_location','Current_CTC','Inhand_Offer','Last_Appraisal_Rating','No_Of_Companies_worked','Number_of_Publications','Certifications','International_degree_any','Expected_CTC'] #Rename column names 
df.columns=col_names
col_names

['IDX',
 'Applicant_ID',
 'Total_Experience',
 'Total_Experience_in_field_applied',
 'Department',
 'Role',
 'Industry',
 'Organization',
 'Designation',
 'Education',
 'Graduation_Specialization',
 'University_Grad',
 'Passing_Year_Of_Graduation',
 'PG_Specialization',
 'University_PG',
 'Passing_Year_Of_PG',
 'PHD_Specialization',
 'University_PHD',
 'Passing_Year_Of_PHD',
 'Curent_Location',
 'Preferred_location',
 'Current_CTC',
 'Inhand_Offer',
 'Last_Appraisal_Rating',
 'No_Of_Companies_worked',
 'Number_of_Publications',
 'Certifications',
 'International_degree_any',
 'Expected_CTC']

In [254]:
df.head()

Unnamed: 0,IDX,Applicant_ID,Total_Experience,Total_Experience_in_field_applied,Department,Role,Industry,Organization,Designation,Education,...,Curent_Location,Preferred_location,Current_CTC,Inhand_Offer,Last_Appraisal_Rating,No_Of_Companies_worked,Number_of_Publications,Certifications,International_degree_any,Expected_CTC
0,IDX,Applicant_ID,Total_Experience,Total_Experience_in_field_applied,Department,Role,Industry,Organization,Designation,Education,...,Curent_Location,Preferred_location,Current_CTC,Inhand_Offer,Last_Appraisal_Rating,No_Of_Companies_worked,Number_of_Publications,Certifications,International_degree_any,Expected_CTC
1,1,22753,0,0,,,,,,PG,...,Guwahati,Pune,0,N,,0,0,0,0,384551
2,2,51087,23,14,HR,Consultant,Analytics,H,HR,Doctorate,...,Bangalore,Nagpur,2702664,Y,Key_Performer,2,4,0,0,3783729
3,3,38413,21,12,Top Management,Consultant,Training,J,,Doctorate,...,Ahmedabad,Jaipur,2236661,Y,Key_Performer,5,3,0,0,3131325
4,4,11501,15,8,Banking,Financial Analyst,Aviation,F,HR,Doctorate,...,Kanpur,Kolkata,2100510,N,C,5,3,0,0,2608833


In [255]:
df = df.drop(index=0).reset_index(drop=True) #dropping unnecessary 0th cell
df.head()

Unnamed: 0,IDX,Applicant_ID,Total_Experience,Total_Experience_in_field_applied,Department,Role,Industry,Organization,Designation,Education,...,Curent_Location,Preferred_location,Current_CTC,Inhand_Offer,Last_Appraisal_Rating,No_Of_Companies_worked,Number_of_Publications,Certifications,International_degree_any,Expected_CTC
0,1,22753,0,0,,,,,,PG,...,Guwahati,Pune,0,N,,0,0,0,0,384551
1,2,51087,23,14,HR,Consultant,Analytics,H,HR,Doctorate,...,Bangalore,Nagpur,2702664,Y,Key_Performer,2,4,0,0,3783729
2,3,38413,21,12,Top Management,Consultant,Training,J,,Doctorate,...,Ahmedabad,Jaipur,2236661,Y,Key_Performer,5,3,0,0,3131325
3,4,11501,15,8,Banking,Financial Analyst,Aviation,F,HR,Doctorate,...,Kanpur,Kolkata,2100510,N,C,5,3,0,0,2608833
4,5,58941,10,5,Sales,Project Manager,Insurance,E,Medical Officer,Grad,...,Ahmedabad,Ahmedabad,1931644,N,C,2,3,0,0,2221390


In [256]:
df.info() #gather information about all attributes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 29 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   IDX                                25000 non-null  object
 1   Applicant_ID                       25000 non-null  object
 2   Total_Experience                   25000 non-null  object
 3   Total_Experience_in_field_applied  25000 non-null  object
 4   Department                         22222 non-null  object
 5   Role                               24037 non-null  object
 6   Industry                           24092 non-null  object
 7   Organization                       24092 non-null  object
 8   Designation                        21871 non-null  object
 9   Education                          25000 non-null  object
 10  Graduation_Specialization          18820 non-null  object
 11  University_Grad                    18820 non-null  object
 12  Pass

In [257]:
col_names=['IDX','Applicant_ID','Total_Experience','Total_Experience_in_field_applied','Department','Role','Industry','Organization','Designation','Education','Graduation_Specialization','University_Grad','Passing_Year_Of_Graduation','PG_Specialization','University_PG','Passing_Year_Of_PG','PHD_Specialization','University_PHD','Passing_Year_Of_PHD','Curent_Location','Preferred_location','Current_CTC','Inhand_Offer','Last_Appraisal_Rating','No_Of_Companies_worked','Number_of_Publications','Certifications','International_degree_any','Expected_CTC'] #Rename column names 
for col in col_names:
    print(df[col].value_counts())  #Checking count of each variables

IDX
1        1
16651    1
16673    1
16672    1
16671    1
        ..
8332     1
8331     1
8330     1
8329     1
25000    1
Name: count, Length: 25000, dtype: int64
Applicant_ID
34487    5
19038    5
30513    5
24678    5
22753    4
        ..
38495    1
49168    1
49747    1
18385    1
15777    1
Name: count, Length: 19766, dtype: int64
Total_Experience
8     1017
21    1004
5     1000
22     996
16     991
19     991
15     988
2      985
7      984
6      975
9      967
1      963
3      962
12     958
17     953
11     953
10     953
13     950
14     944
23     940
24     940
25     936
18     933
20     909
0      908
4      900
Name: count, dtype: int64
Total_Experience_in_field_applied
0     3676
1     2781
2     2217
3     1932
4     1712
5     1533
6     1403
7     1232
8     1089
9      954
10     867
11     776
12     711
14     624
13     610
15     539
16     498
17     381
18     329
19     317
20     245
21     200
22     146
23     109
24      82
25      37
Name: coun

In [258]:
df['Expected_CTC'].value_counts() # checking count of Target variable

Expected_CTC
598151     3
511463     2
3008296    2
858605     2
2553115    2
          ..
682638     1
832676     1
1974720    1
1569358    1
1216666    1
Name: count, Length: 24913, dtype: int64

In [259]:
df.isnull().sum() #check for null values 

IDX                                      0
Applicant_ID                             0
Total_Experience                         0
Total_Experience_in_field_applied        0
Department                            2778
Role                                   963
Industry                               908
Organization                           908
Designation                           3129
Education                                0
Graduation_Specialization             6180
University_Grad                       6180
Passing_Year_Of_Graduation            6180
PG_Specialization                     7692
University_PG                         7692
Passing_Year_Of_PG                    7692
PHD_Specialization                   11881
University_PHD                       11881
Passing_Year_Of_PHD                  11881
Curent_Location                          0
Preferred_location                       0
Current_CTC                              0
Inhand_Offer                             0
Last_Apprai

In [260]:
import pandas as pd

#first, missing values in undergraduation, post graduation and PhD are treated as unknown hence relevant attibutes are updated accordingly.

# Replace missing (NaN) values with 'Unknown' in the 'Graduation_Specialization' column
df['Graduation_Specialization'] = df['Graduation_Specialization'].fillna('Unknown')

# Replace missing (NaN) values with 'Unknown' in the 'University_Grad' column
df['University_Grad'] = df['University_Grad'].fillna('Unknown')

# Replace missing (NaN) values with 'Unknown' in the 'Passing_Year_Of_Graduation' column
df['Passing_Year_Of_Graduation'] = df['Passing_Year_Of_Graduation'].fillna('Unknown')

# Replace missing (NaN) values with 'Unknown' in the 'PG_Specialization' column
df['PG_Specialization'] = df['PG_Specialization'].fillna('Unknown')

# Replace missing (NaN) values with 'Unknown' in the 'University_PG' column
df['University_PG'] = df['University_PG'].fillna('Unknown')

# Replace missing (NaN) values with 'Unknown' in the 'Passing_Year_Of_PG' column
df['Passing_Year_Of_PG'] = df['Passing_Year_Of_PG'].fillna('Unknown')

# Replace missing (NaN) values with 'Unknown' in the 'PHD_Specialization ' column
df['PHD_Specialization'] = df['PHD_Specialization'].fillna('Unknown')

# Replace missing (NaN) values with 'Unknown' in the 'University_PHD' column
df['University_PHD'] = df['University_PHD'].fillna('Unknown')

# Replace missing (NaN) values with 'Unknown' in the 'Passing_Year_Of_PHD' column
df['Passing_Year_Of_PHD'] = df['Passing_Year_Of_PHD'].fillna('Unknown')

df.isnull().sum() #check for null values 

IDX                                     0
Applicant_ID                            0
Total_Experience                        0
Total_Experience_in_field_applied       0
Department                           2778
Role                                  963
Industry                              908
Organization                          908
Designation                          3129
Education                               0
Graduation_Specialization               0
University_Grad                         0
Passing_Year_Of_Graduation              0
PG_Specialization                       0
University_PG                           0
Passing_Year_Of_PG                      0
PHD_Specialization                      0
University_PHD                          0
Passing_Year_Of_PHD                     0
Curent_Location                         0
Preferred_location                      0
Current_CTC                             0
Inhand_Offer                            0
Last_Appraisal_Rating             

In [261]:
# here, organization column has garbage values filled, and will make no change in the prediction process, therefore we delete it.
# here, IDX and Applicant ID are not determining factors in predicting expected CTC therefore we delete them.
df.drop(columns=['Organization', 'IDX','Applicant_ID'], inplace=True)
df.head()

Unnamed: 0,Total_Experience,Total_Experience_in_field_applied,Department,Role,Industry,Designation,Education,Graduation_Specialization,University_Grad,Passing_Year_Of_Graduation,...,Curent_Location,Preferred_location,Current_CTC,Inhand_Offer,Last_Appraisal_Rating,No_Of_Companies_worked,Number_of_Publications,Certifications,International_degree_any,Expected_CTC
0,0,0,,,,,PG,Arts,Lucknow,2020,...,Guwahati,Pune,0,N,,0,0,0,0,384551
1,23,14,HR,Consultant,Analytics,HR,Doctorate,Chemistry,Surat,1988,...,Bangalore,Nagpur,2702664,Y,Key_Performer,2,4,0,0,3783729
2,21,12,Top Management,Consultant,Training,,Doctorate,Zoology,Jaipur,1990,...,Ahmedabad,Jaipur,2236661,Y,Key_Performer,5,3,0,0,3131325
3,15,8,Banking,Financial Analyst,Aviation,HR,Doctorate,Others,Bangalore,1997,...,Kanpur,Kolkata,2100510,N,C,5,3,0,0,2608833
4,10,5,Sales,Project Manager,Insurance,Medical Officer,Grad,Zoology,Mumbai,2004,...,Ahmedabad,Ahmedabad,1931644,N,C,2,3,0,0,2221390


In [262]:
# removing rows where all the 4 following columns having most missing values and are empty together 
cols = ['Department', 'Role', 'Industry', 'Designation']

# Drop rows where all specified columns are NaN or all are 'NA'
df = df[~(
    df[cols].isnull().all(axis=1) | 
    (df[cols].apply(lambda row: all(str(val).strip().upper() == 'NA' for val in row), axis=1))
)]
df.head()

Unnamed: 0,Total_Experience,Total_Experience_in_field_applied,Department,Role,Industry,Designation,Education,Graduation_Specialization,University_Grad,Passing_Year_Of_Graduation,...,Curent_Location,Preferred_location,Current_CTC,Inhand_Offer,Last_Appraisal_Rating,No_Of_Companies_worked,Number_of_Publications,Certifications,International_degree_any,Expected_CTC
1,23,14,HR,Consultant,Analytics,HR,Doctorate,Chemistry,Surat,1988,...,Bangalore,Nagpur,2702664,Y,Key_Performer,2,4,0,0,3783729
2,21,12,Top Management,Consultant,Training,,Doctorate,Zoology,Jaipur,1990,...,Ahmedabad,Jaipur,2236661,Y,Key_Performer,5,3,0,0,3131325
3,15,8,Banking,Financial Analyst,Aviation,HR,Doctorate,Others,Bangalore,1997,...,Kanpur,Kolkata,2100510,N,C,5,3,0,0,2608833
4,10,5,Sales,Project Manager,Insurance,Medical Officer,Grad,Zoology,Mumbai,2004,...,Ahmedabad,Ahmedabad,1931644,N,C,2,3,0,0,2221390
5,16,3,Top Management,Area Sales Manager,Retail,Director,Doctorate,Others,Bangalore,1998,...,Pune,Bhubaneswar,3511167,Y,C,5,4,0,0,4522383


In [263]:
#Checking for null values again
cols_to_fill = ['Department', 'Role', 'Industry', 'Designation']
df[cols_to_fill] = df[cols_to_fill].fillna('Unknown')
df.isnull().sum() #check for null values 

Total_Experience                     0
Total_Experience_in_field_applied    0
Department                           0
Role                                 0
Industry                             0
Designation                          0
Education                            0
Graduation_Specialization            0
University_Grad                      0
Passing_Year_Of_Graduation           0
PG_Specialization                    0
University_PG                        0
Passing_Year_Of_PG                   0
PHD_Specialization                   0
University_PHD                       0
Passing_Year_Of_PHD                  0
Curent_Location                      0
Preferred_location                   0
Current_CTC                          0
Inhand_Offer                         0
Last_Appraisal_Rating                0
No_Of_Companies_worked               0
Number_of_Publications               0
Certifications                       0
International_degree_any             0
Expected_CTC             

In [264]:
# now we will drop rows having outilers in the numerical attributes "Years of Experience, Total_Experience_in_field_applied, current ctc to make the data clean

# first we change the concerning varibles to numeric datatype
df['Total_Experience'] = df['Total_Experience'].astype('int64')
df['Total_Experience_in_field_applied'] = df['Total_Experience_in_field_applied'].astype('int64')
df['Current_CTC'] = df['Current_CTC'].astype('int64')

In [265]:
# Function to remove outliers for all numeric columns
def remove_outliers(df):
    # Iterate through each numeric column
    for col in df.select_dtypes(include=['float64', 'int64']):
        Q1 = df[col].quantile(0.25)  # First quartile
        Q3 = df[col].quantile(0.75)  # Third quartile
        IQR = Q3 - Q1  # Interquartile range
        
        # Define outlier bounds
        lower_limit = Q1 - 1.5 * IQR
        upper_limit = Q3 + 1.5 * IQR
        
        # Filter out rows where the column value is outside the outlier bounds
        df = df[(df[col] >= lower_limit) & (df[col] <= upper_limit)]
        
    return df

# Remove outliers from the entire dataset
df = remove_outliers(df)

# Display the cleaned DataFrame
print(df)


       Total_Experience  Total_Experience_in_field_applied      Department  \
1                    23                                 14              HR   
2                    21                                 12  Top Management   
3                    15                                  8         Banking   
4                    10                                  5           Sales   
5                    16                                  3  Top Management   
...                 ...                                ...             ...   
24995                18                                 13     Engineering   
24996                12                                  8              HR   
24997                22                                  8         Banking   
24998                25                                  8       Marketing   
24999                 8                                  0         Banking   

                     Role    Industry         Designation   Edu

### 10. Declare the feature variable and target variable

In [267]:
#Now that data has been cleaned and structured, the feature variable is dropped from the dataset and the dataset is divided 
y = df['Expected_CTC']
x = df.drop(['Expected_CTC'],axis=1)

### 11. Split data into separate training and test se

In [269]:
#Seperate the data into seperate training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state = 43)

### 12. Feature Engineering  

In [271]:
#check the shape of the X_train and X_test
X_train.shape, X_test.shape

((15988, 25), (7876, 25))

In [272]:
#check data types in X_train
X_train.dtypes

Total_Experience                      int64
Total_Experience_in_field_applied     int64
Department                           object
Role                                 object
Industry                             object
Designation                          object
Education                            object
Graduation_Specialization            object
University_Grad                      object
Passing_Year_Of_Graduation           object
PG_Specialization                    object
University_PG                        object
Passing_Year_Of_PG                   object
PHD_Specialization                   object
University_PHD                       object
Passing_Year_Of_PHD                  object
Curent_Location                      object
Preferred_location                   object
Current_CTC                           int64
Inhand_Offer                         object
Last_Appraisal_Rating                object
No_Of_Companies_worked               object
Number_of_Publications          

In [273]:
X_train.head()

Unnamed: 0,Total_Experience,Total_Experience_in_field_applied,Department,Role,Industry,Designation,Education,Graduation_Specialization,University_Grad,Passing_Year_Of_Graduation,...,Passing_Year_Of_PHD,Curent_Location,Preferred_location,Current_CTC,Inhand_Offer,Last_Appraisal_Rating,No_Of_Companies_worked,Number_of_Publications,Certifications,International_degree_any
11290,11,8,Sales,Consultant,Aviation,Product Manager,PG,Zoology,Kolkata,2009,...,Unknown,Chennai,Chennai,2324766,N,Key_Performer,4,1,0,0
24319,3,0,Education,Others,BFSI,Web Designer,Under Grad,Unknown,Unknown,Unknown,...,Unknown,Bangalore,Surat,653206,Y,B,3,8,0,0
6973,16,5,Healthcare,Researcher,FMCG,Marketing Manager,Grad,Psychology,Pune,1998,...,2004,Guwahati,Guwahati,2556753,N,D,5,6,0,0
7184,15,4,Top Management,Consultant,BFSI,Assistant Manager,Grad,Zoology,Lucknow,1996,...,2005,Nagpur,Bhubaneswar,1553405,N,B,5,5,2,0
14154,4,4,Accounts,Financial Analyst,FMCG,HR,Under Grad,Unknown,Unknown,Unknown,...,Unknown,Jaipur,Mumbai,585267,Y,C,4,2,3,0


##### here, we use different encoding techniques for different attributes.
##### For instance, attributes having high cardiinality are encoded with hasdhing encoder
##### label encoding used for binary data
##### one hot encoding used for other categorical data

In [275]:
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Copy the original data before transforming
X_train_enc = X_train.copy()
X_test_enc = X_test.copy()

### 1. Hashing Encoder for high-cardinality feature: 'Current_CTC'
hash_enc = ce.HashingEncoder(cols=['Current_CTC'])
X_train_name = hash_enc.fit_transform(X_train_enc[['Current_CTC']])
X_test_name = hash_enc.transform(X_test_enc[['Current_CTC']])
X_train_enc = X_train_enc.drop('Current_CTC', axis=1)
X_test_enc = X_test_enc.drop('Current_CTC', axis=1)
X_train_enc = pd.concat([X_train_enc.reset_index(drop=True), X_train_name.reset_index(drop=True)], axis=1)
X_test_enc = pd.concat([X_test_enc.reset_index(drop=True), X_test_name.reset_index(drop=True)], axis=1)

### 2. Label Encoding for binary feature: 'Inhand_Offer'
le = LabelEncoder()
X_train_enc['Inhand_Offer'] = le.fit_transform(X_train['Inhand_Offer'])
X_test_enc['Inhand_Offer'] = le.transform(X_test['Inhand_Offer'])

### 3. OneHot Encoding for 'Department', 'Role', 'Embarked', 'Designation','Education','Graduation_Specialization','University_Grad',
#     'Passing_Year_Of_Graduation','PG_Specialization','University_PG', 'Passing_Year_Of_PG','PHD_Specialization','University_PHD',
#     'Passing_Year_Of_PHD','Curent_Location','Preferred_location'

# Only transform these columns, then join back
ohe = ce.OneHotEncoder(cols=['Department', 'Role', 'Industry', 'Designation','Education','Graduation_Specialization','University_Grad','Passing_Year_Of_Graduation','PG_Specialization','University_PG','Passing_Year_Of_PG','PHD_Specialization','University_PHD','Passing_Year_Of_PHD','Curent_Location','Preferred_location', 'Last_Appraisal_Rating' ], use_cat_names=True)
X_train_ohe = ohe.fit_transform(X_train[['Department', 'Role', 'Industry', 'Designation','Education','Graduation_Specialization','University_Grad','Passing_Year_Of_Graduation','PG_Specialization','University_PG','Passing_Year_Of_PG','PHD_Specialization','University_PHD','Passing_Year_Of_PHD','Curent_Location','Preferred_location','Last_Appraisal_Rating']])
X_test_ohe = ohe.transform(X_test[['Department', 'Role', 'Industry', 'Designation','Education','Graduation_Specialization','University_Grad','Passing_Year_Of_Graduation','PG_Specialization','University_PG','Passing_Year_Of_PG','PHD_Specialization','University_PHD','Passing_Year_Of_PHD','Curent_Location','Preferred_location','Last_Appraisal_Rating']])

# Drop the original categorical columns
X_train_enc = X_train_enc.drop(['Department', 'Role', 'Industry', 'Designation','Education','Graduation_Specialization','University_Grad','Passing_Year_Of_Graduation','PG_Specialization','University_PG','Passing_Year_Of_PG','PHD_Specialization','University_PHD','Passing_Year_Of_PHD','Curent_Location','Preferred_location','Last_Appraisal_Rating'], axis=1)
X_test_enc = X_test_enc.drop(['Department', 'Role', 'Industry', 'Designation','Education','Graduation_Specialization','University_Grad','Passing_Year_Of_Graduation','PG_Specialization','University_PG','Passing_Year_Of_PG','PHD_Specialization','University_PHD','Passing_Year_Of_PHD','Curent_Location','Preferred_location','Last_Appraisal_Rating'], axis=1)

# Add the new encoded columns
X_train_enc = pd.concat([X_train_enc.reset_index(drop=True), X_train_ohe.reset_index(drop=True)], axis=1)
X_test_enc = pd.concat([X_test_enc.reset_index(drop=True), X_test_ohe.reset_index(drop=True)], axis=1)

In [276]:
X_train_enc.head()

Unnamed: 0,Total_Experience,Total_Experience_in_field_applied,Inhand_Offer,No_Of_Companies_worked,Number_of_Publications,Certifications,International_degree_any,col_0,col_1,col_2,...,Preferred_location_Bangalore,Preferred_location_Kanpur,Preferred_location_Mangalore,Preferred_location_Lucknow,Preferred_location_Jaipur,Last_Appraisal_Rating_Key_Performer,Last_Appraisal_Rating_B,Last_Appraisal_Rating_D,Last_Appraisal_Rating_C,Last_Appraisal_Rating_A
0,11,8,0,4,1,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
1,3,0,1,3,8,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
2,16,5,0,5,6,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,15,4,0,5,5,2,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,4,4,1,4,2,3,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [277]:
X_test_enc.head()

Unnamed: 0,Total_Experience,Total_Experience_in_field_applied,Inhand_Offer,No_Of_Companies_worked,Number_of_Publications,Certifications,International_degree_any,col_0,col_1,col_2,...,Preferred_location_Bangalore,Preferred_location_Kanpur,Preferred_location_Mangalore,Preferred_location_Lucknow,Preferred_location_Jaipur,Last_Appraisal_Rating_Key_Performer,Last_Appraisal_Rating_B,Last_Appraisal_Rating_D,Last_Appraisal_Rating_C,Last_Appraisal_Rating_A
0,23,3,0,3,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
1,15,15,0,3,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,4,2,0,4,7,1,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
3,10,0,1,5,7,2,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,19,2,1,6,6,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0


### 13. Decision Tree Regressor with criterion Mean Squared Error (MSE)

In [279]:
#import decision tree regressor
from sklearn.tree import DecisionTreeRegressor

In [280]:
#instantiate the DecisionTreeClassifier model with criterion Gini Index 
clf_mse = DecisionTreeRegressor(criterion = 'squared_error')

#fit the model 
clf_mse.fit(X_train_enc, y_train)

In [281]:
y_pred_mse = clf_mse.predict(X_test_enc)

In [282]:
from sklearn.metrics import r2_score

print("Model R² score with criterion squared_error: {0:0.4f}".format(r2_score(y_test, y_pred_mse)))

Model R² score with criterion squared_error: 0.7866


In [283]:
y_pred_train_mse = clf_mse.predict(X_train_enc)

y_pred_train_mse

array([3022195.,  849167., 2940265., ..., 2349093.,  574047., 3994162.])

In [284]:
print("Training set accuracy score : {0:0.4f}".format(r2_score(y_train, y_pred_train_mse)))

Training set accuracy score : 1.0000


#### Since the model is overfitting, we need to prune the tree accordingly.

In [285]:
from sklearn.tree import DecisionTreeRegressor

clf_mse = DecisionTreeRegressor(
    criterion='squared_error',
    max_depth=5,             # Limit depth of the tree
    min_samples_split=10,    # Minimum samples to split a node
    min_samples_leaf=5,      # Minimum samples at a leaf node
    random_state=42
)

clf_mse.fit(X_train_enc, y_train)

# Evaluate again
y_pred_train = clf_mse.predict(X_train_enc)
y_pred_test = clf_mse.predict(X_test_enc)

print("Training R² Score: {:.4f}".format(r2_score(y_train, y_pred_train)))
print("Testing R² Score : {:.4f}".format(r2_score(y_test, y_pred_test)))


Training R² Score: 0.8629
Testing R² Score : 0.8614


### 14. Results and Conclusion

##### In this project, I build a Decision-Tree Regressor model to predict the safety of the car. The model yields a very good performance as indicated by the model accuracy.
##### Similarly, in the model, the training-set accuracy score is 0.8629 while the test-set accuracy to be 0.8614.We get the same values as in the case with criterion mean squared error. There was sign of overfitting which was handled accordingly.
##### In both the cases, the training-set and test-set accuracy score is almost the same. It may happen because of small dataset.
##### The model overall yields a very good model performance.