# Hands on session

# Linear regression

#### We will construct a linear model that explains the relationship a car's mileage (mpg) has with its other attributes. Here the mpg is a continuous variable and the dataset provided has the target variable in it which makes this a supervised learning.



## Import the necessary Libraries

In [None]:
import numpy as np   

import pandas as pd    
import matplotlib.pyplot as plt 
from scipy.stats import zscore
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split

## Load the datset  
Load the dataset using pd.read_csv() command

In [None]:
cData = pd.read_csv("auto-mpg.csv")  
cData.shape

### Data Definition  

8 variables: 
- MPG (miles per gallon), 
- cylinders, 
- engine displacement (cu. inches), 
- horsepower,
- vehicle weight (lbs.), 
- time to accelerate from O to 60 mph (sec.),
- model year (modulo 100), and 
- origin of car (1. American, 2. European,3. Japanese).
- Also provided are the car labels (types) 
- Missing data values are marked by series of question marks.

## Exploratory Data analysis

### Exploring the data

In [None]:
cData.head()

### Print the shape and dimension of the data

In [None]:
print('The dimension of the data is:')
print(cData.shape)
print('The size of the data:', cData.size)
print('No of rows in the data:', cData.shape[0])
print('No of columns in the data:', cData.shape[1])

### Print the columns

In [None]:
cData.columns

### Get the number of unique observations in each column

In [None]:
cData.nunique()

#### Since the car name has no significance on the miles of the car, we remove it

In [None]:
#dropping/ignoring car_name 
cData = cData.drop('car name', axis=1)

#### Replacing numbers with countries in origin column

In [None]:
# Also replacing the categorical var with actual values
cData['origin'] = cData['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})
cData.head()

### Visualization

Try to be as intuitive as possible and explore various combinations of plot. This is solely for better understanding and presentation of the data at hand.

#### Pair plot

In [None]:
sns.pairplot(cData,diag_kind='kde')

#### Count plots

In [None]:
sns.countplot(cData['origin'])

In [None]:
sns.countplot(cData['cylinders'])

#### Visualizing the distribution of  target variable 'mpg' using various plots

In [None]:
sns.distplot(cData['mpg'])

In [None]:
sns.boxplot(cData['mpg'])

In [None]:
sns.scatterplot(cData['mpg'],cData['weight'], hue=cData['origin'])

In [None]:
sns.scatterplot(cData['mpg'],cData['weight'], hue=cData['cylinders'])

In [None]:
sns.swarmplot(cData['origin'],cData['mpg'], hue= cData['cylinders'])

### Preprocessing the data

#### Create Dummy Variables (One hot encoding) 

Values like 'america' cannot be read into an equation. Using substitutes like 1 for america, 2 for europe and 3 for asia would end up implying that european cars fall exactly half way between american and asian cars. We dont want to impose such an baseless assumption. So we perform one hot encoding on origin column.

In [None]:
cData = pd.get_dummies(cData, columns=['origin'])
cData.head()

In [None]:
cData.tail()

#### Deal with missing values

In [None]:
cData.describe().T

A quick summary of the data columns

#### Check if any values are missing

In [None]:
cData.isnull().sum()

In [None]:
cData.dtypes

In the above cells, we notice that horsepower is not recognized as a number but an object. So even if horsepower had any missing values, it will not be recognized. 

Let's get that sorted by converting horsepower to numerical format

In [None]:
cData = cData.convert_objects(convert_numeric=True)

In [None]:
cData.isnull().sum()

Now we see that horsepower has six missing values

#### Using median values to fill the missing values in that column

In [None]:
medianFiller = lambda x: x.fillna(x.median())
cData = cData.apply(medianFiller,axis=0)

In [None]:
cData.isnull().sum()

Now there are no missing values

In [None]:
cData.columns

#### Normalization

In [None]:
cData2=cData[['cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model year', 'origin_america', 'origin_asia',
       'origin_europe']].apply(zscore)

#### Checking correlation between attributes

In [None]:
plt.figure(figsize = (15,7))
plt.title('Correlation of Attributes', y=1.05, size=19)
sns.heatmap(cData.corr(), cmap='plasma',annot=True, fmt='.2f')

#### From the above plot the following can be inferred
- The correlation of 'acc' with all other attributes are less than 0.5
- The correlation of 'cyl'(cylinder) with 'disp' (displacement) and 'wt' (weight) is high
- The variable 'disp' is highly correlated to 'wt' (weight)
- The variable 'hp' (horsepower) is highly correlated with 'cyl','wt' and 'disp' attributes
#### Normally if we have many attributes or variables we would drop variables that have low correlation with other attributes, for example here, the acceleration variable has low correlation with other variables but since out data set is small, it is fine not to drop a column

### Pair plot

Let's do pair plot once again after one-hot encoding and removing missing values 

In [None]:
sns.pairplot(cData)

In [None]:
cData.head()

## Building model and Evaluation

#### Seperate the target variable from the rest of the attributes/columns

In [None]:
# independant variables
X = cData2
# the dependent variable
y = cData[['mpg']]

#### Using the function train_test_split we separate the dataset into train and test data. The number in the test_size mentions how much of the original dataset should be left for testing. 
Here X,y is split into x_train,x_test, y_train, y_test. 30% is the test size

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

### Create an instance of the linear regression.  
Import Linear regression module from sklearn library. And then create an instance of linear regression funciton and save it to a variable

In [None]:
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression()


### Use linear regression for training  
The variable that has the linear regressor has an inbuilt function for training call fit(). Mention the train dataset inside the square brackets

In [None]:
regression_model.fit(X_train, y_train)

### Now evaluate it. This gives us the R^2 score of our model.  
Similarly we use another in built function called score to get the R^2 score of our model

In [None]:
regression_model.score(X_train, y_train)

### Now let's use our trained model on test data  
Use the predict command on the test data to get the predicted results.

In [None]:
results=regression_model.predict(X_test)
print(type(results))

### Comparing the predicted reults with actual values  
Printing the first ten values of the predicted results and the first 10 test data labels

In [None]:
for i in range (10):
    print(results[i])
    

In [None]:
y_test.head(10)

### Evaluate the score for test data  
Using the same score function get the R^2 score for test data

In [None]:
regression_model.score(X_test, y_test)

#### As you can see the score is fairly high enough and the results are also fairly close to the actual label. Surprisingly the score for test data is better than train data

# K-Means Clustering

#### Let's use the K means clustering algorithm on the same dataset

## Import the dataset and normalize it  
We are using the same dataset so no need to load it again, just apply zscore normalization.

In [None]:
cData3=cData.apply(zscore)

## Find the number of K to initialize using Elbow method  
For K means clustering, it is essential that we initialize the number of clusters we want. We are using Elbow method for that.
In the code below out of 1 through 10 we need to find the optimal value of k.

In [None]:
clusters=range(1,10)
meanDistortions=[]

for k in clusters:
    model=KMeans(n_clusters=k)
    model.fit(cData3)
    prediction=model.predict(cData3)
    meanDistortions.append(sum(np.min(cdist(cData3, model.cluster_centers_, 'euclidean'), axis=1)) / cData3.shape[0])


plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')


We could see a clear curve and cluster 4 seems to be the elbow point

## Building Model.

### Create an instance of Kmeans  
Mention the number of clusters with in it and fit it on the training data

In [None]:
final_model=KMeans(4)
final_model.fit(cData3)

### Use the trained alogrithm to group the data

In [None]:
prediction=final_model.predict(cData3)

### Add the predictions to the original dataset
Load the original dataset, create a new column called group and put the predictions in it.

In [None]:
cData4= pd.read_csv("auto-mpg.csv")

In [None]:
cData4['group']=prediction

In [None]:
cData4.head(10)

In [None]:
cData4.tail(10)

### Find the mean of each attribute in each group  
First group the data by each group.  
Use the df.mean() command

In [None]:
Clust = cData4.groupby(['group'])

In [None]:
Clust.mean()

# Logistic Regression

## Load the dataset

Read the data using pd.read_csv function and save it to a variable

In [None]:
df = pd.read_csv("pima-indians-diabetes.csv")

## Splitting the labels from attributes/other columns  
We split the labels from the rest of the attributes and apply normalization for the attributes

In [None]:
X = df.drop('class',axis=1)    
Y = df['class']   
X=X.apply(zscore)


## Splitting the dataset into train and test  
The entire dataset is split using train_test_split function. 70% train and 30% test

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

### No of people who have diabetes and don't diabetes in Original, train and test data

In [None]:
print("Original Dataset Diabetes     : {0} ({1:0.2f}%)".format(len(df.loc[df['class'] == 1]), (len(df.loc[df['class'] == 1])/len(df.index)) * 100))
print("Original Dataset No Diabetes     : {0} ({1:0.2f}%)".format(len(df.loc[df['class'] == 0]), (len(df.loc[df['class'] == 0])/len(df.index)) * 100))
print("")
print("Training Dataset Diabetes    : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Dataset No Diabetess   : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Dataset Diabetes        : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Dataset No Diabetes       : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")

## Building model and evaluation
Import Logistic regression from sklearn modul. Create an instance of the logistic regression, we also mention the type of solver we need, this is optinal. If no mentioned, the default value will be chosen. Once that is done, fit the model on the training data.

### Training

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver="liblinear")
model.fit(x_train, y_train)






### Use the model to predict on test data

In [None]:
y_predict = model.predict(x_test)

### Compare the actual test data and the predicted results

In [None]:
y_test.head(10)

In [None]:
for i in range(10):
    print(y_predict[i])

### Evaluation

#### Get R^2 Score

In [None]:
model_score = model.score(x_test, y_test)
print(model_score)

#### Print the confusion matrix  
The sklearn library has an module called metrics which has inbuilt function for printing the confusion matrix. From which we can get True positive, True negative, False positive and False negative.

In [None]:
from sklearn import metrics
cm=metrics.confusion_matrix(y_test, y_predict, labels=[1, 0])

df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True)

The confusion matrix

True Positives (TP): we correctly predicted that they do have diabetes 48

True Negatives (TN): we correctly predicted that they don't have diabetes 132

False Positives (FP): we incorrectly predicted that they do have diabetes (a "Type I error") 14 Falsely predict positive Type I error

False Negatives (FN): we incorrectly predicted that they don't have diabetes (a "Type II error") 37 Falsely predict negative Type II error

Sklearn has inbuilt functions for calculating various metircs like F1 score, Accuaracy, recall, precision

#### Accuracy

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
y_test=y_test.to_numpy() # the function requires numpy data, so we are converting dataframe to numpy

Using the accuracy score function we pass in the actual y values and predicted y values to get the accuracy score

In [None]:
accuracy_score(y_test, y_predict)

Similarly we can calculate the other metrics using the respective in-built functions. We mention here average as 'binary' because we have only two classes in our label. If they were more than one you have to use 'macro' instead. There are also other options avaiable.

#### Recall score

In [None]:
from sklearn.metrics import recall_score

In [None]:
recall_score(y_test, y_predict, average='binary')

#### Precision score

In [None]:
from sklearn.metrics import precision_score

In [None]:
precision_score(y_test, y_predict, average='binary')

#### F1 score

In [None]:
from sklearn.metrics import f1_score

In [None]:
f1_score(y_test, y_predict, average='binary')

# Support vector machines  
# With and Without Principal Component Analysis(PCA)

## Load the data  
Load the data into the notebook using pd.read_csv() function

In [None]:
df=pd.read_csv('Part3 - vehicle.csv')

## EDA

### Print the dimensions of the dataset

In [None]:
print('The size of the data:', df.size)
print('No of rows in the data:', df.shape[0])
print('No of columns in the data:', df.shape[1])

### Preprocess the data

#### Splitting the data attribute columns from the label column

In [None]:
x=df.drop('class',axis=1)
y=df[['class']]

#### Imputing missing values  
Converting non numerical values to numerical, if present. Then checking for missing values and filling them with median values

In [None]:
x = x.convert_objects(convert_numeric=True)

In [None]:
x.isnull().sum()

In [None]:
medianFiller = lambda x: x.fillna(x.median())
x = x.apply(medianFiller,axis=0)

In [None]:
x.isnull().sum()

In [None]:
y.isnull().sum()

#### Checking for zero values

In [None]:
(x==0).all()

In [None]:
(y==0).all()

Almost all the attributes are continuous

### Correlation Table

In [None]:
plt.figure(figsize = (15,7))
plt.title('Correlation of Attributes', y=1.05, size=19)
sns.heatmap(x.corr(), cmap='plasma',annot=True, fmt='.2f')

#### From the correlation table the attributes from 'compactness' to 'scaled_radius_of_gyration.1' have strong correlation with other values 

### Visualization

In [None]:
sns.pairplot(x, diag_kind='kde')

# SVM without PCA

## Splitting the dataset into train and test 

Normalize the data before splitting

In [None]:
x_scaled=x.apply(zscore)

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(x_scaled,y, test_size=0.30, random_state=1)

## Defining a custom accuracy funtion  
For every correct prediction a counter called correct increases by one step

In [None]:
def getAccuracy(testSet, predictions):
    correct = 0
    for x in range(len(testSet)):
        if testSet[x]== predictions[x]:
            correct += 1
    return (correct/float(len(testSet))) * 100.0

## Building Model

### Initialize an instance of the svm function  
Import SVM module from sklearn library and then use SVC ( which is the classifier function that uses support vector machines). Use the same hyper parameter as below

In [None]:
from sklearn import svm
clf = svm.SVC(gamma=0.025, C=3)    

### Training

#### Since we are going to compare the perfomance of SVM algorithm with and without PCA let's track how long it takes to run both.  
Use the datetime library to get the time at that particular moment. Measure the time taked to train.

In [None]:
import datetime
t1=datetime.datetime.now()
clf.fit(xtrain , ytrain)
y_pred = clf.predict(xtest)
t2=datetime.datetime.now()

#### Time taken  
Print the time taked for the training process

In [None]:
print('Time Taken:', t2-t1)

In [None]:
ytest2=ytest.to_numpy()
np.resize(ytest2,(254,))

### Accuracy of the model

In [None]:
print('Accuracy :', getAccuracy(ytest2 , y_pred))

# Applying PCA

In [None]:
covMatrix = np.cov(x_scaled,rowvar=False)

## Create an instance of PCA 
Import PCA module from sklearn library. Inside the PCA function, mention the number of attribute columns

In [None]:
from sklearn.decomposition import PCA, IncrementalPCA
pca = PCA(n_components=18)


## Get the number of optimal eigen components required  
First we fit the initialized PCA on the entire attribute column and then we plot a graph with eigen components in the x axis vs the variance exlpained along y axis. We need minimum number of eigen components that explain 90% of the variation

In [None]:
pca.fit(x_scaled)
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()

### Cummulative plot of the above

In [None]:
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

#### From the graph we can understand that eigen value of 10 explains almost 100% of the variation. So we select only 10 components. Now we reduce 18 columns from the original dataset to 10 columns PCA data 

## Transform the input data using PCA  
PCA will transform the input data of 18 columns into a data of 10 columns. It has mathematical functions that will do the job for you. All these 10 columns will have the same data contained in 18 columns

In [None]:
pca3 = PCA(n_components=10)
pca3.fit(x_scaled)
#print(pca3.components_)
#print(pca3.explained_variance_ratio_)
xpca = pca3.transform(x_scaled)

### Pairplotting the new dataset

In [None]:
sns.pairplot(pd.DataFrame(xpca))


#### After applying the PCA we can understand that the variables are not related to one another and it is evident from the scatter plots and all those plots are normally distributed. 

## Build model

### Splitting the dataset

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(xpca,y,test_size=0.30, random_state=1)

### Training and testing

In [None]:
t3=datetime.datetime.now()
clf.fit(xtrain , ytrain)
y_pred = clf.predict(xtest)
t4=datetime.datetime.now()

### Printing the time taken

In [None]:
print('Time Taken :',t4-t3)

### Get accuracy

In [None]:
print('Accuracy:',getAccuracy(ytest2 , y_pred))

## Conclusion

### On comparing the results we can clearly see that on using PCA we were able reduce the dimensions from 18 to 10 and accuracy isn't much affected but SVM when trained with PCA performed much faster. 

# Decision Tree as a regressor

## Import Libraries  

In [None]:
import numpy as np   
from sklearn.model_selection import train_test_split

import pandas as pd    
import matplotlib.pyplot as plt   
import seaborn as sns

## Load data  
Load the dataset using pd.read_csv files

In [None]:

mpg_df = pd.read_csv('auto-mpg.csv')  

## Data preprocessing

### Remove the insignificant columns

In [None]:
mpg_df.drop('car name',axis=1,inplace= True)
mpg_df.drop('model year',axis=1,inplace= True)


### Change numbers to countries and do one hot encoding

In [None]:
mpg_df['origin'] = mpg_df['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})
mpg_df = pd.get_dummies(mpg_df, columns=['origin'])
mpg_df.head()

### Fill missing values

In [None]:
mpg_df = mpg_df.convert_objects(convert_numeric=True)
#cData.isnull().sum()


In [None]:
medianFiller = lambda x: x.fillna(x.median())
mpg_df = mpg_df.apply(medianFiller,axis=0)
mpg_df.isnull().sum()

### Split the labels from attributes and create training and test data


In [None]:
X = mpg_df.copy(deep=True)
X.drop('mpg',axis=1, inplace= True)
y = mpg_df[['mpg']]  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

## Build model  


### Training
Import decision tree regression (since we are using for regression problem) and fit it on training data. Use the same hyperparameters

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=0, max_depth=3)

regressor.fit(X_train , y_train)


### Testing

In [None]:
y_pred = regressor.predict(X_test)

### Evaluating using R^2 score

In [None]:
X_test=X_test.to_numpy()
y_test=y_test.to_numpy()

In [None]:
score = regressor.score(X_test, y_test)

In [None]:
print(score)

# Pipelines

## Import Libraries

In [None]:
from sklearn.datasets import load_breast_cancer 
from sklearn.model_selection import train_test_split 

## Load Libraries and split it

#### We are using breast cancer dataset which is readily available in sklearn datasets  
We load it into the notebook using the command load_breast_cancer(). And then splitting the dataset into train and test

In [None]:
cancer = load_breast_cancer() 
X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, random_state = 0) # compute minimum and maximum on the training data 

## Create a pipeline
In the below code, we have created a pipeline for SVM classification(SVC) and normalization using min_max scalar. We need import those functions before entring them in the pipeline. We also assign names to each function in apostrophe 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC 
pipe = Pipeline([(" scaler", MinMaxScaler()), (" svm", SVC())])

## Fit the pipeline on training data

In [None]:
pipe.fit( X_train, y_train)

## R^2 score on test set   
Use the score function on the test dataset

In [None]:
print(" Test score: {:.2f}". format( pipe.score( X_test, y_test)))

## Get the predicted labels  
Use the predict function and fit it on X_test to get the predictions

In [None]:
y_pred = pipe.predict(X_test)

## Get the rest of metrics  
sklearn's metric module has an single function called classification report that will print the necessary metrics. It takes in the test labels and predicted labels

In [None]:
from sklearn import metrics

print(metrics.classification_report(y_test, y_pred))

# Hyper Parameter Tuning

## Preparing Dataset
Again are using an dataset that is readily available in sklearn. We import the iris dataset using the command datasets.load_iris(). Then we split the labels from attributes and then split the dataset into train and test dataset. 

In [None]:
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data[:,2:]
y = iris.target
from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,random_state = 7)

## KNN  
We will be using the KNN algorithm for classification. To perform hyperparameter tuning it is necessary that one must know about the parameters each algorithm has. In this example we will tune only two parameters of the KNN method

## Parameter Tuning  
For parameter tuning, you need to create a dictionary with parameter names of the model as keys and their respective values in the value section. For example n_neighbors is the key and the respective value is any integers, hence we have provided a list of 1 to 9 in the value section. And similarly KNN has four different algorithms which are mentioned under the algorithm key  

In [None]:

param_grid = {'n_neighbors': list(range(1,9)),
             'algorithm': ('auto', 'ball_tree', 'kd_tree' , 'brute') }

### Create a KNN instance  
Import KNN classifier from sklearn.neighbors and save it to a variable

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()

### Create a Gridsearch instance
Import Gridsearch algorithm from sklearn and create an instance of it and save it to the variable. Mention the variable(which has the machine learning algorithm), then the parameter grid and third variable cross validation here is optional.

In [None]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(knn_clf,param_grid,cv=10)

### Fit on training  data

In [None]:
gs.fit(X_train, y_train)

### Get the best parameters  
Use the command best_params_ to get the best parameters from the option we provided

In [None]:
gs.best_params_

### Get the test score of all the combinations

In [None]:
gs.cv_results_['params']

### Print the scores for each combination

In [None]:
gs.cv_results_['mean_test_score']

## Build model  using best parameters
Here we import the KNN fron sklearn and save it to a variable. Inside the round brackets mention the parameters and their best values

In [None]:
knn_clf2 = KNeighborsClassifier(algorithm='brute',n_neighbors=1)

### Train the algorithm, test it and evaluate it

In [None]:
knn_clf2.fit(X_train, y_train)

In [None]:
y_predict = knn_clf2.predict(X_test)

In [None]:
knn_clf2.score(X_test,y_test)

# Try it yourself

You will be provided a dataset. The objective is to find whether a client is eligible for a loan or not based on the other columns(attributes). Since it is a binary classifier, use logistic regression

Q1: Import the necessary libraries  
You migh need: pandas, numpy, seaborn, matplotlib.pyplot, sklearn, scipy

Q2: Load the data into the notebook using pd.read_csv(file.csv)

Q3: Remove the folowing columns  
- Experience
- Id
- Zip code
To drop multiple column use df.drop(['col1','col2'],inplace= True)

Q4: Take anyone dataset for now and perform EDA, try to implement the following function  
- df.mean,
- df.mode,
- df.median,
- df.describe, 
- df.quantile(q=0.25%), 
- df.corr()
- df.cov()
- df.info,
- df.nunique,
- df['col'].value_counts()


Q5: Plot various columns, try to use the following commands  
- sns.distplot()
- sns.pairplot()
- sns.scatterplot()
- sns.jointplot()
- sns.countplot()
- sns.boxplot

Q6: Try to preprocess the data   
- Convert class object to numerical object
- Removing missing values or imputing with median values
- One hot encoding if needed

Q7:Separate target variable(Personal Loan) from rest of the columns and split into training and test data  
Use train_test_split() command  
- X= attributes
- Y= target column
- Perform normalization on all columns of X before splitting into train and test

Q8: Create an instance of logistic Regression and fit it on training data

Q9: Now use the trained model to predict by using predict() function on test data

Q10: Get the confusion matrix, print accuracy, recall, f1_score and precision