
### Context

What is Engineering?

Engineering is the use of scientific principles to design and build machines, structures, and other items, including bridges, tunnels, roads, vehicles, and buildings. The discipline of engineering encompasses a broad range of more specialized fields of engineering, each with a more specific emphasis on particular areas of applied mathematics, applied science, and types of application.
![image.png](attachment:image.png)
 

Engineering is a broad discipline that is often broken down into several sub-disciplines. Although an engineer will usually be trained in a specific discipline, he or she may become multi-disciplined through experience. Engineering is often characterized as having four main branches: chemical engineering, civil engineering, electrical engineering, and mechanical engineering. [Reference: Wikipedia]

Engineering Graduates in India

India has a total 6,214 Engineering and Technology Institutions in which around 2.9 million students are enrolled. Every year on an average 1.5 million students get their degree in engineering, but due to lack of skill required to perform technical jobs less than 20 percent get employment in their core domain. [source of information: BWEDUCATION]

#### Objective
A relevant question is what determines the salary and the jobs these engineers are offered right after graduation. Various factors such as college grades, candidate skills, the proximity of the college to industrial hubs, the specialization one have, market conditions for specific industries determine this. On the basis of these various factors, your objective is to determine the salary of an engineering graduate in India.



#### Evaluation Criteria
Submissions are evaluated using Root-Mean-Squared-Error (RMSE).
![image.png](attachment:image.png)



### About the dataset
The dataset contains 33 attributes. The target variable refers to the salary of an Engineering Graduate in India. 

To load the training data in your jupyter notebook, use the below command:

import pandas as pd

eng_grad_data  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/eng_grad_emp_salary/training_set_label.csv" )

#### Data Description
- ID: A unique ID to identify a candidate
- Salary: Annual CTC offered to the candidate (in INR)
- Gender: Candidate's gender
- DOB: Date of birth of the candidate
- 10percentage: Overall marks obtained in grade 10 examinations
- 10board: The school board whose curriculum the candidate followed in grade 10
- 12graduation: Year of graduation - senior year high school
- 12percentage: Overall marks obtained in grade 12 examinations
- 12board: The school board whose curriculum the candidate followed
- CollegeID: Unique ID identifying the university/college which the candidate attended for her/his undergraduate
- CollegeTier: Each college has been annotated as 1 or 2. The annotations have been computed from the average AMCAT scores obtained by the students in the college/university. Colleges with an average score above a threshold are tagged as 1 and others as 2.
- Degree: Degree obtained/pursued by the candidate
- Specialization: Specialization pursued by the candidate
- CollegeGPA: Aggregate GPA at graduation
- CollegeCityID: A unique ID to identify the city in which the college is located in.
- CollegeCityTier: The tier of the city in which the college is located in. This is annotated based on the population of the cities.
-  CollegeState: Name of the state in which the college is located
- GraduationYear: Year of graduation (Bachelor's degree)
- English: Scores in AMCAT English section
- Logical: Score in AMCAT Logical ability section
- Quant: Score in AMCAT's Quantitative ability section
- Domain: Scores in AMCAT's domain module
- ComputerProgramming: Score in AMCAT's Computer programming section
- ElectronicsAndSemicon: Score in AMCAT's Electronics & Semiconductor Engineering section
- ComputerScience: Score in AMCAT's Computer Science section
- MechanicalEngg: Score in AMCAT's Mechanical Engineering section
- ElectricalEngg: Score in AMCAT's Electrical Engineering section
- TelecomEngg: Score in AMCAT's Telecommunication Engineering section
- CivilEngg: Score in AMCAT's Civil Engineering section
- conscientiousness: Scores in one of the sections of AMCAT's personality test
- agreeableness: Scores in one of the sections of AMCAT's personality test
- extraversion: Scores in one of the sections of AMCAT's personality test
- nueroticism: Scores in one of the sections of AMCAT's personality test
- openess_to_experience: Scores in one of the sections of AMCAT's personality test
Note: To give you more context AMCAT is a job portal.

Test Dataset
Load the test data (name it as test_data). You can load the data using the below command.

test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/eng_grad_emp_salary/testing_set_label.csv')

Here the target column is deliberately not there as you need to predict it.



In [1]:
#libraries to use
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import minmax_scale
import sweetviz as sv
%matplotlib inline

In [2]:
pd.set_option('display.max_columns', None)
eng_grad_train_data  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/eng_grad_emp_salary/training_set_label.csv" )
holdout_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/eng_grad_emp_salary/testing_set_label.csv')

In [3]:
eng_grad_train_data.head()

Unnamed: 0,ID,Gender,DOB,10percentage,10board,12graduation,12percentage,12board,CollegeID,CollegeTier,Degree,Specialization,collegeGPA,CollegeCityID,CollegeCityTier,CollegeState,GraduationYear,English,Logical,Quant,Domain,ComputerProgramming,ElectronicsAndSemicon,ComputerScience,MechanicalEngg,ElectricalEngg,TelecomEngg,CivilEngg,conscientiousness,agreeableness,extraversion,nueroticism,openess_to_experience,Salary
0,604399,f,1990-10-22,87.8,cbse,2009,84.0,cbse,6920,1,B.Tech/B.E.,instrumentation and control engineering,73.82,6920,1,Delhi,2013,650,665,810,0.694479,485,366,-1,-1,-1,-1,-1,-0.159,0.3789,1.2396,0.1459,0.2889,445000
1,988334,m,1990-05-15,57.0,cbse,2010,64.5,cbse,6624,2,B.Tech/B.E.,computer science & engineering,65.0,6624,0,Uttar Pradesh,2014,440,435,210,0.342315,365,-1,-1,-1,-1,-1,-1,1.1336,0.0459,1.2396,0.5262,-0.2859,110000
2,301647,m,1989-08-21,77.33,"maharashtra state board,pune",2007,85.17,amravati divisional board,9084,2,B.Tech/B.E.,electronics & telecommunications,61.94,9084,0,Maharashtra,2011,485,475,505,0.824666,-1,400,-1,-1,-1,260,-1,0.51,-0.1232,1.5428,-0.2902,-0.2875,255000
3,582313,m,1991-05-04,84.3,cbse,2009,86.0,cbse,8195,1,B.Tech/B.E.,computer science & engineering,80.4,8195,1,Delhi,2013,675,620,635,0.990009,655,-1,-1,-1,-1,-1,-1,-0.4463,0.2124,0.3174,0.2727,0.4805,420000
4,339001,f,1990-10-30,82.0,cbse,2008,75.0,cbse,4889,2,B.Tech/B.E.,biotechnology,64.3,4889,1,Tamil Nadu,2012,575,495,365,0.278457,315,-1,-1,-1,-1,-1,-1,-1.4992,-0.7473,-1.0697,0.06223,0.1864,200000


In [4]:
#to visualize the data
my_report = sv.analyze(holdout_data)
my_report.show_html()

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=34.0), HTML(value='')), layout=Layout(dis…


Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [5]:
eng_grad_train_data.shape
holdout_data.shape

(1000, 33)

In [6]:
eng_grad_train_data['birth_year']=eng_grad_train_data['DOB'].str.split('-').str[0]
holdout_data['birth_year']=holdout_data['DOB'].str.split('-').str[0]

In [7]:
eng_grad_train_data['birth_year']=eng_grad_train_data['birth_year'].astype('category')
holdout_data['birth_year']=holdout_data['birth_year'].astype('category')
eng_grad_train_data['CollegeTier']=eng_grad_train_data['CollegeTier'].astype('category')
holdout_data['CollegeTier']=holdout_data['CollegeTier'].astype('category')
eng_grad_train_data['GraduationYear']=eng_grad_train_data['GraduationYear'].astype('category')
holdout_data['GraduationYear']=holdout_data['GraduationYear'].astype('category')

In [8]:
holdout_data.describe(include='all')

Unnamed: 0,ID,Gender,DOB,10percentage,10board,12graduation,12percentage,12board,CollegeID,CollegeTier,Degree,Specialization,collegeGPA,CollegeCityID,CollegeCityTier,CollegeState,GraduationYear,English,Logical,Quant,Domain,ComputerProgramming,ElectronicsAndSemicon,ComputerScience,MechanicalEngg,ElectricalEngg,TelecomEngg,CivilEngg,conscientiousness,agreeableness,extraversion,nueroticism,openess_to_experience,birth_year
count,1000.0,1000,1000,1000.0,1000,1000.0,1000.0,1000,1000.0,1000.0,1000,1000,1000.0,1000.0,1000.0,1000,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
unique,,2,786,,106,,,118,,2.0,4,31,,,,22,9.0,,,,,,,,,,,,,,,,,14.0
top,,m,1991-01-01,,cbse,,,cbse,,2.0,B.Tech/B.E.,electronics and communication engineering,,,,Uttar Pradesh,2013.0,,,,,,,,,,,,,,,,,1991.0
freq,,759,6,,369,,,361,,929.0,943,210,,,,217,297.0,,,,,,,,,,,,,,,,,252.0
mean,660502.6,,,78.70246,,2008.108,74.84203,,4996.88,,,,71.41516,4996.88,0.313,,,503.396,505.098,511.101,0.516582,356.803,92.631,80.535,19.485,17.111,34.198,4.893,-0.035185,0.207292,0.037015,-0.238189,-0.129113,
std,358304.8,,,9.339532,,1.717931,10.627497,,4877.271717,,,,8.304125,4877.271717,0.463946,,,103.872329,85.163168,122.656722,0.484394,207.881122,156.791161,167.189291,92.924145,92.064459,108.676038,47.451778,1.040167,0.896064,0.916616,0.988733,1.01134,
min,16037.0,,,48.0,,1995.0,45.0,,13.0,,,,6.45,13.0,0.0,,,225.0,245.0,135.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-4.1267,-4.7826,-4.2935,-2.643,-5.8428,
25%,336799.5,,,72.5325,,2007.0,67.0,,439.0,,,,66.0,439.0,0.0,,,425.0,450.0,425.0,0.356536,302.5,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-0.7264,-0.2871,-0.4891,-0.995,-0.643,
50%,638999.5,,,80.0,,2008.0,75.0,,3579.0,,,,71.515,3579.0,0.0,,,500.0,505.0,515.0,0.64939,425.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.0464,0.2668,0.0914,-0.2609,-0.0943,
75%,983340.8,,,86.0,,2009.0,82.425,,8350.25,,,,76.4125,8350.25,1.0,,,575.0,565.0,595.0,0.843124,505.0,233.0,-1.0,-1.0,-1.0,-1.0,-1.0,0.7027,0.8784,0.6248,0.4148,0.552675,


In [9]:
eng_grad_train_data['10board']=eng_grad_train_data['10board'].astype('category').cat.codes
holdout_data['10board']=holdout_data['10board'].astype('category').cat.codes

In [10]:
eng_grad_train_data.drop(columns=['DOB','12board'],inplace=True)
holdout_data.drop(columns=['DOB','12board'],inplace=True)

In [11]:
eng_grad_train_data.info()
holdout_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2998 entries, 0 to 2997
Data columns (total 33 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   ID                     2998 non-null   int64   
 1   Gender                 2998 non-null   object  
 2   10percentage           2998 non-null   float64 
 3   10board                2998 non-null   int16   
 4   12graduation           2998 non-null   int64   
 5   12percentage           2998 non-null   float64 
 6   CollegeID              2998 non-null   int64   
 7   CollegeTier            2998 non-null   category
 8   Degree                 2998 non-null   object  
 9   Specialization         2998 non-null   object  
 10  collegeGPA             2998 non-null   float64 
 11  CollegeCityID          2998 non-null   int64   
 12  CollegeCityTier        2998 non-null   int64   
 13  CollegeState           2998 non-null   object  
 14  GraduationYear         2998 non-null   c

In [12]:
eng_grad_train_data['Specialization']=eng_grad_train_data['Specialization'].str.replace('electronics & instrumentation eng',
                                                                                     'electronics and instrumentation engineering'  )
holdout_data['Specialization']=holdout_data['Specialization'].str.replace('electronics & instrumentation eng',
                                                                                     'electronics and instrumentation engineering'  )

In [13]:
def create_dummies(df,column_name):
    dummies=pd.get_dummies(df[column_name],prefix=column_name)
    df=pd.concat([df,dummies],axis=1)
    del df[column_name]
    return df
for col in eng_grad_train_data.select_dtypes(include=['object']).columns:
    eng_grad_train_data[col]=eng_grad_train_data[col].str.lower().str.strip()
    holdout_data[col]=holdout_data[col].str.lower().str.strip()
    eng_grad_train_data=create_dummies(eng_grad_train_data,col)
    holdout_data=create_dummies(holdout_data,col)
for col in eng_grad_train_data.select_dtypes(include=['category']).columns:
    eng_grad_train_data=create_dummies(eng_grad_train_data,col)
    holdout_data=create_dummies(holdout_data,col)

In [14]:
X=eng_grad_train_data.drop(columns=['Salary'])
Y=eng_grad_train_data.Salary
print(X.shape)

(2998, 127)


In [15]:
X.shape

(2998, 127)

In [16]:
holdout_data.columns

Index(['ID', '10percentage', '10board', '12graduation', '12percentage',
       'CollegeID', 'collegeGPA', 'CollegeCityID', 'CollegeCityTier',
       'English',
       ...
       'birth_year_1985', 'birth_year_1986', 'birth_year_1987',
       'birth_year_1988', 'birth_year_1989', 'birth_year_1990',
       'birth_year_1991', 'birth_year_1992', 'birth_year_1993',
       'birth_year_1994'],
      dtype='object', length=108)

In [17]:
X.columns

Index(['ID', '10percentage', '10board', '12graduation', '12percentage',
       'CollegeID', 'collegeGPA', 'CollegeCityID', 'CollegeCityTier',
       'English',
       ...
       'birth_year_1987', 'birth_year_1988', 'birth_year_1989',
       'birth_year_1990', 'birth_year_1991', 'birth_year_1992',
       'birth_year_1993', 'birth_year_1994', 'birth_year_1995',
       'birth_year_1997'],
      dtype='object', length=127)

In [18]:
X=X.drop(columns=['CollegeState_assam'
,'CollegeState_goa',
'CollegeState_union territory',
'CollegeState_meghalaya',
                 'Specialization_aeronautical engineering','Specialization_biomedical engineering','Specialization_ceramic engineering',
                 'Specialization_computer and communication engineering','Specialization_computer networking','Specialization_control and instrumentation engineering',
                 'Specialization_embedded systems technology','Specialization_industrial & management engineering','Specialization_industrial engineering',
                 'Specialization_information science', 'Specialization_mechanical & production engineering',
       'Specialization_mechanical and automation',
       'Specialization_mechanical engineering','GraduationYear_0','GraduationYear_2007','birth_year_1981','birth_year_1995',
'birth_year_1997'])
holdout_data=holdout_data.drop(columns=['Specialization_polymer technology',
       'Specialization_power systems and automation','birth_year_1977'])

In [19]:
holdout_data.shape

(1000, 105)

In [20]:
X.head()

Unnamed: 0,ID,10percentage,10board,12graduation,12percentage,CollegeID,collegeGPA,CollegeCityID,CollegeCityTier,English,Logical,Quant,Domain,ComputerProgramming,ElectronicsAndSemicon,ComputerScience,MechanicalEngg,ElectricalEngg,TelecomEngg,CivilEngg,conscientiousness,agreeableness,extraversion,nueroticism,openess_to_experience,Gender_f,Gender_m,Degree_b.tech/b.e.,Degree_m.sc. (tech.),Degree_m.tech./m.e.,Degree_mca,Specialization_applied electronics and instrumentation,Specialization_automobile/automotive engineering,Specialization_biotechnology,Specialization_chemical engineering,Specialization_civil engineering,Specialization_computer application,Specialization_computer engineering,Specialization_computer science & engineering,Specialization_computer science and technology,Specialization_electrical and power engineering,Specialization_electrical engineering,Specialization_electronics,Specialization_electronics & telecommunications,Specialization_electronics and communication engineering,Specialization_electronics and computer engineering,Specialization_electronics and electrical engineering,Specialization_electronics and instrumentation engineering,Specialization_electronics engineering,Specialization_industrial & production engineering,Specialization_information & communication technology,Specialization_information science engineering,Specialization_information technology,Specialization_instrumentation and control engineering,Specialization_instrumentation engineering,Specialization_mechatronics,Specialization_metallurgical engineering,Specialization_other,Specialization_telecommunication engineering,CollegeState_andhra pradesh,CollegeState_bihar,CollegeState_chhattisgarh,CollegeState_delhi,CollegeState_gujarat,CollegeState_haryana,CollegeState_himachal pradesh,CollegeState_jammu and kashmir,CollegeState_jharkhand,CollegeState_karnataka,CollegeState_kerala,CollegeState_madhya pradesh,CollegeState_maharashtra,CollegeState_orissa,CollegeState_punjab,CollegeState_rajasthan,CollegeState_sikkim,CollegeState_tamil nadu,CollegeState_telangana,CollegeState_uttar pradesh,CollegeState_uttarakhand,CollegeState_west bengal,CollegeTier_1,CollegeTier_2,GraduationYear_2009,GraduationYear_2010,GraduationYear_2011,GraduationYear_2012,GraduationYear_2013,GraduationYear_2014,GraduationYear_2015,GraduationYear_2016,GraduationYear_2017,birth_year_1982,birth_year_1983,birth_year_1984,birth_year_1985,birth_year_1986,birth_year_1987,birth_year_1988,birth_year_1989,birth_year_1990,birth_year_1991,birth_year_1992,birth_year_1993,birth_year_1994
0,604399,87.8,46,2009,84.0,6920,73.82,6920,1,650,665,810,0.694479,485,366,-1,-1,-1,-1,-1,-0.159,0.3789,1.2396,0.1459,0.2889,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,988334,57.0,46,2010,64.5,6624,65.0,6624,0,440,435,210,0.342315,365,-1,-1,-1,-1,-1,-1,1.1336,0.0459,1.2396,0.5262,-0.2859,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,301647,77.33,129,2007,85.17,9084,61.94,9084,0,485,475,505,0.824666,-1,400,-1,-1,-1,260,-1,0.51,-0.1232,1.5428,-0.2902,-0.2875,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,582313,84.3,46,2009,86.0,8195,80.4,8195,1,675,620,635,0.990009,655,-1,-1,-1,-1,-1,-1,-0.4463,0.2124,0.3174,0.2727,0.4805,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,339001,82.0,46,2008,75.0,4889,64.3,4889,1,575,495,365,0.278457,315,-1,-1,-1,-1,-1,-1,-1.4992,-0.7473,-1.0697,0.06223,0.1864,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [21]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X=pd.DataFrame(scaler.fit_transform(X),columns=X.columns)
holdout_data=pd.DataFrame(scaler.fit_transform(holdout_data),columns=holdout_data.columns)
# from sklearn.preprocessing import minmax_scale
# for col in eng_grad_train_data.columns:
#     eng_grad_train_data[col+"scaled"]=minmax_scale(eng_grad_train_data[col])
#     #holdout_data[col+"scaled"]=minmax_scale(holdout_data[col])
# for col in holdout_data.columns:
#     holdout_data[col+"scaled"]=minmax_scale(holdout_data[col])
    

In [22]:
holdout_data.head()

Unnamed: 0,ID,10percentage,10board,12graduation,12percentage,CollegeID,collegeGPA,CollegeCityID,CollegeCityTier,English,Logical,Quant,Domain,ComputerProgramming,ElectronicsAndSemicon,ComputerScience,MechanicalEngg,ElectricalEngg,TelecomEngg,CivilEngg,conscientiousness,agreeableness,extraversion,nueroticism,openess_to_experience,Gender_f,Gender_m,Degree_b.tech/b.e.,Degree_m.sc. (tech.),Degree_m.tech./m.e.,Degree_mca,Specialization_applied electronics and instrumentation,Specialization_automobile/automotive engineering,Specialization_biotechnology,Specialization_chemical engineering,Specialization_civil engineering,Specialization_computer application,Specialization_computer engineering,Specialization_computer science,Specialization_computer science & engineering,Specialization_computer science and technology,Specialization_electrical and power engineering,Specialization_electrical engineering,Specialization_electronics & telecommunications,Specialization_electronics and communication engineering,Specialization_electronics and electrical engineering,Specialization_electronics and instrumentation engineering,Specialization_electronics engineering,Specialization_industrial & production engineering,Specialization_information & communication technology,Specialization_information science engineering,Specialization_information technology,Specialization_instrumentation and control engineering,Specialization_instrumentation engineering,Specialization_internal combustion engine,Specialization_mechanical engineering,Specialization_mechatronics,Specialization_other,Specialization_telecommunication engineering,CollegeState_andhra pradesh,CollegeState_bihar,CollegeState_chhattisgarh,CollegeState_delhi,CollegeState_gujarat,CollegeState_haryana,CollegeState_himachal pradesh,CollegeState_jammu and kashmir,CollegeState_jharkhand,CollegeState_karnataka,CollegeState_kerala,CollegeState_madhya pradesh,CollegeState_maharashtra,CollegeState_orissa,CollegeState_punjab,CollegeState_rajasthan,CollegeState_sikkim,CollegeState_tamil nadu,CollegeState_telangana,CollegeState_uttar pradesh,CollegeState_uttarakhand,CollegeState_west bengal,CollegeTier_1,CollegeTier_2,GraduationYear_2009,GraduationYear_2010,GraduationYear_2011,GraduationYear_2012,GraduationYear_2013,GraduationYear_2014,GraduationYear_2015,GraduationYear_2016,GraduationYear_2017,birth_year_1982,birth_year_1983,birth_year_1984,birth_year_1985,birth_year_1986,birth_year_1987,birth_year_1988,birth_year_1989,birth_year_1990,birth_year_1991,birth_year_1992,birth_year_1993,birth_year_1994
0,-1.174447,1.510203,-1.443601,-0.645285,1.30086,-0.642254,1.531914,-0.642254,1.481516,-1.429353,-0.001151,0.847495,0.636339,0.328222,1.961353,-0.487925,-0.220559,-0.196819,4.039727,-0.124251,-1.110873,-0.717486,-2.316866,2.32544,-0.156689,-0.563492,0.563492,0.245856,-0.031639,-0.114766,-0.211972,-0.063372,-0.031639,-0.054855,-0.044766,-0.119159,-0.211972,-0.476439,-0.044766,-0.479596,-0.044766,-0.031639,-0.139169,5.5,-0.51558,-0.224544,-0.131507,-0.077693,-0.044766,-0.031639,-0.095298,-0.426653,-0.044766,-0.031639,-0.031639,-0.219586,-0.031639,-0.054855,-0.044766,-0.236572,-0.044766,-0.100504,-0.204124,-0.077693,-0.214535,-0.054855,-0.044766,-0.089803,3.199368,-0.095298,-0.217072,-0.280622,-0.209383,-0.234206,-0.224544,-0.031639,-0.338862,-0.274352,-0.52644,-0.181818,-0.231821,-0.276453,0.276453,-0.089803,-0.290859,2.645751,-0.509358,-0.649981,-0.580429,-0.175863,-0.054855,-0.031639,-0.031639,-0.031639,-0.044766,-0.110208,-0.123404,-0.172818,-0.259299,2.553193,-0.496873,-0.580429,-0.510915,-0.28269,-0.110208
1,0.804629,-0.396627,-0.263249,1.101876,-0.667662,1.684595,0.178897,1.684595,-0.674985,-0.851432,-2.644461,-1.191733,0.815475,-0.345577,-0.597469,2.510178,-0.220559,-0.196819,-0.324042,-0.124251,-0.809905,-0.737919,0.641576,0.132143,0.792626,-0.563492,0.563492,0.245856,-0.031639,-0.114766,-0.211972,-0.063372,-0.031639,-0.054855,-0.044766,-0.119159,-0.211972,-0.476439,-0.044766,2.085088,-0.044766,-0.031639,-0.139169,-0.181818,-0.51558,-0.224544,-0.131507,-0.077693,-0.044766,-0.031639,-0.095298,-0.426653,-0.044766,-0.031639,-0.031639,-0.219586,-0.031639,-0.054855,-0.044766,-0.236572,-0.044766,-0.100504,-0.204124,-0.077693,-0.214535,-0.054855,-0.044766,-0.089803,-0.312562,-0.095298,-0.217072,-0.280622,-0.209383,-0.234206,-0.224544,-0.031639,-0.338862,-0.274352,-0.52644,-0.181818,4.313681,-0.276453,0.276453,-0.089803,-0.290859,-0.377964,-0.509358,-0.649981,1.722862,-0.175863,-0.054855,-0.031639,-0.031639,-0.031639,-0.044766,-0.110208,-0.123404,-0.172818,-0.259299,-0.391666,-0.496873,1.722862,-0.510915,-0.28269,-0.110208
2,-0.755108,0.053299,-0.014754,-0.062898,-1.49141,0.323728,0.190945,0.323728,1.481516,-1.044072,-0.001151,-1.444598,0.019293,0.280093,-0.597469,-0.487925,-0.220559,-0.196819,-0.324042,-0.124251,1.416429,0.327717,-0.455505,0.898558,0.780952,-0.563492,0.563492,-4.067414,-0.031639,8.713385,-0.211972,-0.063372,-0.031639,-0.054855,-0.044766,-0.119159,-0.211972,2.098906,-0.044766,-0.479596,-0.044766,-0.031639,-0.139169,-0.181818,-0.51558,-0.224544,-0.131507,-0.077693,-0.044766,-0.031639,-0.095298,-0.426653,-0.044766,-0.031639,-0.031639,-0.219586,-0.031639,-0.054855,-0.044766,-0.236572,-0.044766,-0.100504,-0.204124,-0.077693,-0.214535,-0.054855,-0.044766,-0.089803,3.199368,-0.095298,-0.217072,-0.280622,-0.209383,-0.234206,-0.224544,-0.031639,-0.338862,-0.274352,-0.52644,-0.181818,-0.231821,-0.276453,0.276453,-0.089803,-0.290859,-0.377964,-0.509358,-0.649981,1.722862,-0.175863,-0.054855,-0.031639,-0.031639,-0.031639,-0.044766,-0.110208,-0.123404,-0.172818,-0.259299,-0.391666,-0.496873,1.722862,-0.510915,-0.28269,-0.110208
3,-0.222795,0.4818,-0.729177,0.519489,0.523242,0.656047,0.287331,0.656047,1.481516,0.978652,1.232393,0.398865,0.674462,0.761378,-0.597469,-0.487925,-0.220559,-0.196819,-0.324042,-0.124251,-1.224276,0.935232,-0.029487,0.645279,-0.344751,1.774649,-1.774649,0.245856,-0.031639,-0.114766,-0.211972,-0.063372,-0.031639,-0.054855,-0.044766,-0.119159,-0.211972,-0.476439,-0.044766,2.085088,-0.044766,-0.031639,-0.139169,-0.181818,-0.51558,-0.224544,-0.131507,-0.077693,-0.044766,-0.031639,-0.095298,-0.426653,-0.044766,-0.031639,-0.031639,-0.219586,-0.031639,-0.054855,-0.044766,-0.236572,-0.044766,-0.100504,4.898979,-0.077693,-0.214535,-0.054855,-0.044766,-0.089803,-0.312562,-0.095298,-0.217072,-0.280622,-0.209383,-0.234206,-0.224544,-0.031639,-0.338862,-0.274352,-0.52644,-0.181818,-0.231821,3.617251,-3.617251,-0.089803,-0.290859,-0.377964,-0.509358,1.538507,-0.580429,-0.175863,-0.054855,-0.031639,-0.031639,-0.031639,-0.044766,-0.110208,-0.123404,-0.172818,-0.259299,-0.391666,-0.496873,1.722862,-0.510915,-0.28269,-0.110208
4,-0.906404,0.717476,-1.443601,-0.062898,1.304626,-0.790977,-1.13436,-0.790977,-0.674985,-0.466151,-0.940995,0.276511,0.045038,-1.722052,1.533819,-0.487925,-0.220559,-0.196819,2.318156,-0.124251,0.747451,1.895354,-0.099016,0.244566,-0.156689,-0.563492,0.563492,0.245856,-0.031639,-0.114766,-0.211972,-0.063372,-0.031639,-0.054855,-0.044766,-0.119159,-0.211972,-0.476439,-0.044766,-0.479596,-0.044766,-0.031639,-0.139169,-0.181818,1.939563,-0.224544,-0.131507,-0.077693,-0.044766,-0.031639,-0.095298,-0.426653,-0.044766,-0.031639,-0.031639,-0.219586,-0.031639,-0.054855,-0.044766,4.227047,-0.044766,-0.100504,-0.204124,-0.077693,-0.214535,-0.054855,-0.044766,-0.089803,-0.312562,-0.095298,-0.217072,-0.280622,-0.209383,-0.234206,-0.224544,-0.031639,-0.338862,-0.274352,-0.52644,-0.181818,-0.231821,-0.276453,0.276453,-0.089803,-0.290859,-0.377964,1.963255,-0.649981,-0.580429,-0.175863,-0.054855,-0.031639,-0.031639,-0.031639,-0.044766,-0.110208,-0.123404,-0.172818,-0.259299,-0.391666,-0.496873,1.722862,-0.510915,-0.28269,-0.110208


In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, 
                                                    test_size=0.11, 
                                                    random_state=11)

In [24]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
rg=Ridge(max_iter=2000,random_state=21,alpha=0.75,normalize=True,solver='sparse_cg')
rg.fit(X_train,y_train)
prediction=rg.predict(X_test)
rmse=np.sqrt(mean_squared_error(y_test,prediction))
rmse



137358.50880536312

In [25]:
from sklearn.feature_selection import RFE #importing RFE class from sklearn library

rfe = RFE(estimator= rg , step = 1) 
# estimator rg is the baseline model (basic model) that we have created under "Base line Model" selection
# step = 1: removes one feature at a time and then builds a model on the remaining features
# It uses the model accuracy to identify which features (and combination of features) contribute the most to predicting the target variable.
# we can even provide no. of features as an argument 

# Fit the function for ranking the features
fit = rfe.fit(X_train, y_train)

print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 52
Selected Features: [ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True False  True  True  True False  True
  True  True  True False False False False False False  True False False
 False  True False False False False False False  True False False False
 False  True False False False  True False False False False False False
  True False  True False False False False  True  True  True False False
 False  True  True  True  True False False False  True  True  True False
  True  True False  True  True  True False False False  True  True False
  True  True False False False  True False False False]
Feature Ranking: [ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 11  1  1  1  8  1
  1  1  1 36 19 40 18 22 15  1 45 33  5  1 20 43 30  7  2 32  1 26 25 38
 54  1 52 17 50  1 34  4 21 13 49 48  1 35  1 47  9 39 10  1  1  1  3 12
 51  1  1  1  1 42 16 24  1  1  1 14  1  1 23  1  1  1 27  6 53  1  1 31
  1  1 44 29 46

In [26]:
pd.set_option('display.max_rows', 500)
selected_rfe_features = pd.DataFrame({'Feature':list(X_train.columns),
                                      'Ranking':rfe.ranking_})
selected_rfe_features.sort_values(by='Ranking')

Unnamed: 0,Feature,Ranking
0,ID,1
101,birth_year_1991,1
82,CollegeTier_2,1
37,Specialization_computer engineering,1
81,CollegeTier_1,1
80,CollegeState_west bengal,1
85,GraduationYear_2011,1
44,Specialization_electronics and communication e...,1
87,GraduationYear_2013,1
97,birth_year_1987,1


In [27]:
# Transforming the data
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_test)
holdout_rfe = rfe.transform(holdout_data)

# Fitting our baseline model with the transformed data
lr_rfe_model = rg.fit(X_train_rfe, y_train)

In [28]:
# making predictions and evaluating the model
y_pred_rfe = lr_rfe_model.predict(X_test_rfe)
rmse=np.sqrt(mean_squared_error(y_test,y_pred_rfe))
rmse


137346.99400093584

In [29]:
prediction = lr_rfe_model.predict(holdout_rfe)
res = pd.DataFrame(prediction) #target is nothing but the final predictions of your model on input features of your new unseen test data
res.index = holdout_data.index # its important for comparison. Here "test_new" is your new test dataset
res.columns = ["prediction"]
res.to_csv("submission3.csv",index=False)

Link for the challenge: https://dphi.tech/challenges/data-sprint-2-engineering-graduates-employment-outcomes/20/overview/about