supervised-learning-Classification

Mini project on classification problem using different classification algorithm problem statement::

K-Nearest-Neighbors

KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consiting of training observations (x,y) and would like to capture the relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given an unseen observation x, h(x) can confidently predict the corresponding output y.

In this module we will explore the inner workings of KNN, choosing the optimal K values and using KNN from scikit-learn.

Problem statement

Dataset

The data set we’ll be using is the Iris Flower Dataset which was first introduced in 1936 by the famous statistician Ronald Fisher and consists of 50 observations from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals.

Source: https://archive.ics.uci.edu/ml/datasets/Iris

Train the KNN algorithm to be able to distinguish the species from one another given the measurements of the 4 features.

Question 1

Read the iris.csv file

Data Pre-processing

Question 2 - Estimating missing values

*Its not good to remove the records having missing values all the time. We may end up loosing some data points. So, we will have to see how to replace those missing values with some estimated values (median) *

Question 3 - Dealing with categorical data

Change all the classes to numericals (0to2).

Question 4

Observe the association of each independent variable with target variable and drop variables from feature set having correlation in range -0.1 to 0.1 with target variable.

Question 5

Observe the independent variables variance and drop such variables having no variance or almost zero variance(variance < 0.1). They will be having almost no influence on the classification.

Question 6

Plot the scatter matrix for all the variables.

Split the dataset into training and test sets

Question 7

Split the dataset into training and test sets with 80-20 ratio.

Question 8 - Model

Build the model and train and test on training and test sets respectively using scikit-learn. Print the Accuracy of the model with different values of k=3,5,9.

Hint: For accuracy you can check accuracy_score() in scikit-learn

Question 9 - Cross Validation

Run the KNN with no of neighbours to be 1,3,5..19 and *Find the optimal number of neighbours from the above list using the Mis classification error

Hint:

Misclassification error (MSE) = 1 - Test accuracy score. Calculated MSE for each model with neighbours = 1,3,5...19 and find the model with lowest MSE

Question 10

Plot misclassification error vs k (with k value on X-axis) using matplotlib.

Naive Bayes

#Load all required library

Question 1

Import Iris.csv

Question 2

Slice data set for Independent variables and dependent variables

Please note 'Species' is my dependent variables, name it y and independent set data as X

Question 3

Find the distribution of target variable (Class)

And, Plot the distribution of target variable using histogram

Plot the distribution of target variable using histogram

Plot Scatter Matrix to understand the distribution of variables and give insights from it.

Question 3

Find Correlation among all variables and give your insights

Question 4

Split data in Training and Validation in 80:20

SPLITTING INTO TRAINING AND TEST SETS

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,random_state=22)

Question 5

Do Feature Scaling

Question 6

Train and Fit NaiveBayes Model

Question 7

Print Accuracy and Confusion Matrix and Conclude your findings

Dear Participant,

Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.

Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician.

Use the provided dataset in order to do your analysis.

#Attribute Information:

#Matrix column entries (attributes): #name - ASCII subject name and recording number #MDVP:Fo(Hz) - Average vocal fundamental frequency #MDVP:Fhi(Hz) - Maximum vocal fundamental frequency #MDVP:Flo(Hz) - Minimum vocal fundamental frequency #MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several #measures of variation in fundamental frequency #MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude #NHR,HNR - Two measures of ratio of noise to tonal components in the voice #status - Health status of the subject (one) - Parkinson's, (zero) - healthy #RPDE,D2 - Two nonlinear dynamical complexity measures #DFA - Signal fractal scaling exponent #spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

Import required library

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

Q1. Load the dataset

Q2. Use the .describe() method on the dataset and state any insights you may come across.

Q3. Check for class imbalance. Do people with Parkinson's have greater representation in the dataset?

Q4. Check for missing vaues and take necessary measures by dropping observation or imputing them.

Q5. Plot the distribution of all the features. State any observations you can make based on the distribution plots.

Q6. Check for outliers in the data. Are there any variables with high amount of outliers.

Q7. Are there any strong correlations among the independent features?

Q8. Split dataset into training & test dataset

Q9. Create a default decision tree model using criterion = Entropy

Q10. Use regularization parameters of max_depth, min_sample_leaf to recreate the model. What is the impact on the model accuracy? How does regularization help?

Q11. Implement a Random Forest model. What is the optimal number of trees that gives the best result?

Logistic regression

Dataset:

German Credit

Objective

Estimate default probabilities using logistic regression

1. Load Libraries and data

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore')

2. Check how many records do we have

data=pd.read_excel('GermanCredit.XLSX') data.shape

3. Plot Histogram for column 'CreditAmount'

sns.histplot(data.CreditAmount)

3b. Create creditability dataframe

4. Concatenate the above 2 dataframes and give the total of Credibiliity0 and Credibiliity1

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Classification_KNN_NB-Question.ipynb		Classification_KNN_NB-Question.ipynb
Ensemble_parkinson_mini_proj.ipynb		Ensemble_parkinson_mini_proj.ipynb
Logistic regression_Questions.ipynb		Logistic regression_Questions.ipynb
README.md		README.md

Manjeetjakhar13/supervised-learning-Classification

Folders and files

Latest commit

History

Repository files navigation

supervised-learning-Classification

K-Nearest-Neighbors

Problem statement

Dataset

Question 1

Read the iris.csv file

Data Pre-processing

Question 2 - Estimating missing values

Question 3 - Dealing with categorical data

Question 4

Question 5

Question 6

Split the dataset into training and test sets

Question 7

Question 8 - Model

Question 9 - Cross Validation

Question 10

Naive Bayes

Question 1

Import Iris.csv

Question 2

Slice data set for Independent variables and dependent variables

Please note 'Species' is my dependent variables, name it y and independent set data as X

Question 3

Find the distribution of target variable (Class)

And, Plot the distribution of target variable using histogram

Plot the distribution of target variable using histogram

Plot Scatter Matrix to understand the distribution of variables and give insights from it.

Question 3

Find Correlation among all variables and give your insights

Question 4

Split data in Training and Validation in 80:20

SPLITTING INTO TRAINING AND TEST SETS

Question 5

Do Feature Scaling

Question 6

Train and Fit NaiveBayes Model

Question 7

Print Accuracy and Confusion Matrix and Conclude your findings

Import required library

Q1. Load the dataset

Q2. Use the .describe() method on the dataset and state any insights you may come across.

Q3. Check for class imbalance. Do people with Parkinson's have greater representation in the dataset?

Q4. Check for missing vaues and take necessary measures by dropping observation or imputing them.

Q5. Plot the distribution of all the features. State any observations you can make based on the distribution plots.

Q6. Check for outliers in the data. Are there any variables with high amount of outliers.

Q7. Are there any strong correlations among the independent features?

Q8. Split dataset into training & test dataset

Q9. Create a default decision tree model using criterion = Entropy

Q10. Use regularization parameters of max_depth, min_sample_leaf to recreate the model. What is the impact on the model accuracy? How does regularization help?

Q11. Implement a Random Forest model. What is the optimal number of trees that gives the best result?

Logistic regression

Dataset:

Objective

1. Load Libraries and data

2. Check how many records do we have

3. Plot Histogram for column 'CreditAmount'

3b. Create creditability dataframe

4. Concatenate the above 2 dataframes and give the total of Credibiliity0 and Credibiliity1

5. Plot Creditworthiness plot for Credibility == 0 and also ==1

6. Prepare input data for the model

7. Fit logistic regression model

8. Test accuracy calculation

9. Build a confusion matrix

10. Predicted Probability distribution Plots for Defaults and Non Defaults

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages