# Campus Recruitment

## Multiple Linear Regression

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Reading and Understanding the data

In [None]:
#reading the data

df = pd.read_csv('../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')

df.head()

In [None]:
df.shape

In [None]:
df.info()

* All the columns except 'salary' have no null values. 
* Salary for the students who are not placed has been mentioned as null. We can impute those values as zero.

In [None]:
# changing null values in the salary column to zero

df['salary'].fillna(0,inplace=True)

df.info()

* Now there is no null value in the dataframe.

In [None]:
#Dropping the variable sl_no since it has no impact on the dependent variable

df.drop('sl_no',axis=1,inplace=True)

In [None]:
#Checking the distribution of the data

df.describe()

* All the columns except salary & etest_p are normally distributed. Distribution looks good.

In [None]:
#Checking for duplicate rows

df.loc[df.duplicated()]

### Data Visualization

In [None]:
#plotting the distribution plot

df_num = df.select_dtypes(include=[np.number])

col_num = list(df_num.columns)

c = len(col_num)
m = 1
n = 0

plt.figure(figsize=(20,30))

for i in col_num:
  if m in range(1,c+1):
    plt.subplot(8,4,m)
    sns.distplot(df_num[df_num.columns[n]])
    m=m+1
    n=n+1

plt.show()

* There are no duplicate rows.

In [None]:
#Plotting the pairplot

sns.heatmap(df.corr(),linewidth=0.5,cmap='YlGnBu',annot=True)
plt.show()

* It is quite intresting to find out that salary does not depend on the MBA percentage or the employbility test percentage but there is a fair chance of getting a good salary if the student scores well in Secondary education.

In [None]:
df.info()

#### Data Preparation

* The value 'Others' have been used in three categorical columns, ssc_b,hsc_b & degree_t
* It will create problem while converting the values to dummy variable and might result in same column name 'Others' for all these three columns.
* We must make that value Unique.

In [None]:
df.ssc_b.replace('Others','sscb_other',inplace=True)
df.hsc_b.replace('Others','hscb_other',inplace=True)
df.degree_t.replace('Others','deg_other',inplace=True)

In [None]:
df.head()

In [None]:
# Function for creating dummy variables for categorical variables

def dummy(x,df):
    temp = pd.get_dummies(df[x],drop_first = True)
    df =pd.concat([df,temp],axis=1)
    df.drop(x,axis=1,inplace=True)
    return df

#Getting dummy variables for the categorical variables in df
df = dummy('status',df)
df = dummy('specialisation',df)
df = dummy('workex',df)
df = dummy('degree_t',df)
df = dummy('hsc_s',df)
df = dummy('hsc_b',df)
df = dummy('ssc_b',df)
df = dummy('gender',df)

In [None]:
df.head()

* Lets rename the column 'Yes' to 'Workex' for better understanding.


In [None]:
df.rename(columns={'Yes':'Workex','M':'Male'},inplace=True)
df.head()

### Model Building

#### Dividing the data into train and test sets

In [None]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(1001)

df_train,df_test = train_test_split(df,test_size=0.2,random_state=100)

#### Feature Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_train[col_num] = scaler.fit_transform(df_train[col_num])



### Q1.  Develop an estimated multiple linear regression equation with mbap as response variable and sscp & hscp as the two predictor variables. Interpret the regression coefficients and check whether they are significant based on the summary

#### Model 1

#### Dividing the train and test set into X&y variables.

In [None]:
y_train = df_train['mba_p']
X_train = df_train[['ssc_p','hsc_p']]

In [None]:
import statsmodels.api as sm

#Adding constant to X_train since by default statsmodel fits a regression line passing through the origin.
X_train = sm.add_constant(X_train)

#Fitting linear model

lm = sm.OLS(y_train,X_train).fit()



In [None]:
#printing paramaters

print(lm.params)

#Printing the summary

print(lm.summary())

* The R-squared value is 0.193 which means only 19.3% variance in mba_p is explained by ssc_p and hsc_p.
* Coefficients of both the independent variables has a very low p-value which means these are statistically significant.
* But the F-statistics is very low which explains overall fit of the model is not statistically significant. We can do better by adding more variables.

### Q2. Estimate a multiple regression equation for each of the below scenarios and based on the model’s R-square comment which model is better. 

#### (i) Use mbap as outcome variable and sscp & degreep as the two predictor variables. (Model 2)
#### (ii) Use mbap as outcome variable and hscp & degreep as the two predictor variables. (Model 3)

#### Model 2

In [None]:
y_train2 = df_train['mba_p']
X_train2 = df_train[['ssc_p','degree_p']]

In [None]:
#Adding constant
X_train2 = sm.add_constant(X_train2)

#fitting linear model

lm2 = sm.OLS(y_train2,X_train2).fit()

In [None]:
#Printing model parameters

print(lm2.params)

#printing model summary
print(lm2.summary())

* The R-squared value and F-statistics have improved slightly.
* The Independent variables have very low p-value which means ssc_p and degree_p are important features.

#### Model 3

In [None]:
y_train3 = df_train['mba_p']
X_train3 = df_train[['hsc_p','degree_p']]

In [None]:
#Adding constant to X_tarin3

X_train3 = sm.add_constant(X_train3)

#fitting the linear model

lm3 = sm.OLS(y_train3,X_train3).fit()

In [None]:
print(lm3.params)
print(lm3.summary())

* The results are quite similar to Model 2. Lets take all three independent variables and see if we get any improvement.

### Q3. Show the functional form of a multiple regression model. Build a regression model with mbap as dependent variable and sscp, hscp and degree_p as three independent variables.

#### Model 4

In [None]:
y_train4 = df_train['mba_p']
X_train4 = df_train[['ssc_p','hsc_p','degree_p']]

In [None]:
#Adding constant to X_train4

X_train4= sm.add_constant(X_train4)

#Fitting the linear model

lm4 = sm.OLS(y_train4,X_train4).fit()

In [None]:
#Printing coefficients and statistical summary
print(lm4.params)

print(lm4.summary())

* There isn't much improvement in the R2 score with ssc_p,hsc_p,degree_p as independent variable.
* The F-statistics has dropped to 16.56 which means overall fit of the model with three variables is worse than the previous model.
* The best model among these four models is Model 3. Lets do the residual analysis of the model.

### Residual Analysis

In [None]:
#Predicting on the train data

y_train_pred = lm4.predict(X_train4)

In [None]:
# Plotting the histogram of the error terms

fig = plt.figure()
sns.distplot((y_train4 - y_train_pred), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                
plt.xlabel('Errors', fontsize = 18)   
plt.show()

* The residuals are normally distributed with mean zero, which satisfies our assumptions of Linear regression.
* Lets predict on the test data now.

### Prediction and evaluation

#### Applying scaling on test set.

In [None]:
df_test.head()

In [None]:
#Transforming the numerical varianles of test data
df_test[col_num] = scaler.transform(df_test[col_num])

In [None]:
#Extracting X_test and y_test from the df_test  

X_test = df_test[['ssc_p','hsc_p','degree_p']]
y_test = df_test['mba_p']


In [None]:
#Adding constant

X_test = sm.add_constant(X_test)

#Predicting on th emodel

y_pred = lm4.predict(X_test)

In [None]:
#Evaluating R2 score on the predictions

from sklearn.metrics import r2_score

print(r2_score(y_test,y_pred))

* Getting a R2 score which is slightly less than the training R2 score.

In [None]:
# Plotting y_test and y_pred to understand the spread.

fig = plt.figure()
sns.scatterplot(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)   
plt.show()

In [None]:
#Putting y_test and y_pred to a dataframe.

compare_pred = pd.DataFrame(columns=['y_test','y_pred'])
compare_pred['y_test'] = y_test
compare_pred['y_pred'] = y_pred

compare_pred.head(10)

### Conclusion:

* As we can see above the R2 score of test data is very low and the predicted values on test data is far from the actual values.
* Hence Secondary school percentage and Higher secondary school percentage are not valid factors of deciding MBA percentage of a student.