### UpLift Project
### Perform any 2 data mining steps on the below given dataset.
* Make 2 data visualization steps.
* fit in a model to predict the salary.

https://drive.google.com/file/d/1MkN8BcyToPVGP-uoTuUe8AGu56yx5rPH/view?usp=sharing



In [0]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
# Getting the file which we'are going to work on.

In [0]:

link = 'https://drive.google.com/open?id=1MkN8BcyToPVGP-uoTuUe8AGu56yx5rPH' # The shareable link

In [0]:
fluff, id = link.split('=')
print (id) # Verify that we have everything after '='

In [0]:
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('Filename.csv')  
df = pd.read_csv('Filename.csv')

### Let's check data out.

In [0]:
df.head()

In [0]:
df.describe()

In [0]:
df.info()

### Checking the null values for "salary" column

In [0]:
df_nan = df[np.isnan(df.salary) == True]
df_nan                    

### Drop non necesary columns

In [0]:
df = df.drop(['sl_no'], axis=1)

### Check the correlation of variables with respect to salary

In [0]:
df.head()



### This shows that the variables with stronger correlation are **"etest_p"**	and **"mba_p"**



### Now, it's recommended to fill in the missing (NaN) values on column "salary". Let's apply an imputation method with the mean value of the column (Not always the best way), but for this specific case fits pretty well.

In [0]:
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(df[['salary']])
df['salary']=imputer.transform(df[['salary']])

### Check "salary" column to make sure has no any NaN value

In [0]:
df['salary']

In [0]:
df_nan = df[np.isnan(df.salary) == True]
len(df_nan)

### Now, we're going to encoding categorical data into binary(One-hot encoding)


In [0]:
df = pd.get_dummies(df, drop_first=True)

In [0]:
df.head()

### Now, we're going to build and train the Regression model.

In [0]:
# First, split data into "train" and "test"
X = df.drop('salary', axis=1) #Takes all columns except the target one "salary"
y = df['salary']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Showing the results
print('The mean absolute error is: {}'.format(np.round(mean_absolute_error(y_test, y_pred), 2)))
print('The mean squared error is: {}'.format(np.round(mean_squared_error(y_test, y_pred), 2)))
print('The mean r2 score is: {}'.format(np.round(r2_score(y_test, y_pred), 2)))


### Ploting the results

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt

# plt.scatter(X, y, color='black')
plt.figure(figsize=(10,8))
plt.plot(y_test.values, color='blue')
plt.plot(y_pred, color='red')
plt.show()